Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories
In information extraction, a named entity is a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper noun. For example, in the sentence - “Biden is the president of the United States”, “Biden” and “the United States” are named entities (proper nouns). “president” is not a named entity
Spacy’s Token Matcher lets you prepare Spacy Rules at token level invoking all complex
Example 1 from Spacy Documentation
patterns = [ [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}], # captures any case variant of "hello-world", "hello!world" [{"LOWER": "hello"}, {"LOWER": "world"}] # captures any case variant of "hello world"]
Example 2 using Token-level Regex
pattern = [{"TEXT": {"REGEX": "^[Uu](\.?|nited)$"}}, # {"TEXT": {"REGEX": "^[Ss](\.?|tates)$"}}, {"LOWER": "president"}] # captures (U.S or US or United States or united states) President
1B. About Phrase Matcher
If we have a huge list of phrases in a list or in a csv file, phrase matcher can be applied directly
import spacyfrom spacy.matcher import PhraseMatchernlp = spacy.load("en_core_web_sm")matcher = PhraseMatcher(nlp.vocab)terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]# Only run nlp.make_doc to speed things uppatterns = [nlp.make_doc(text) for text in terms]matcher.add("TerminologyList", patterns)doc = nlp("German Chancellor Angela Merkel and US President Barack Obama ""converse in the Oval Office inside the White House in Washington, D.C.")matches = matcher(doc)for match_id, start, end in matches: span = doc[start:end]print(span.text)
1C. Explaining Token and Phrase Matchers with a MODEL_NAMES NER Capture
Code
# goal: to capture Car Models to their corresponding Makessample_sentence_1 ="Go for ford mustang mache variants. Mustang has a deceivingly huge trunk and good horse power. If you want reliability, Toyota Lexus should be your choice. Lexus has good body style too"sample_sentence_2 ="Considering the harsh winters here, I am considering 2014 Nissan Murano or the '14 Subaru Forester"sample_sentence_3 ="Among used cars, I am still not sure what to choose - Civic or Corolla?"sample_models_sentences = [sample_sentence_1, sample_sentence_2, sample_sentence_3 ]
Go for ford mustang macheFORD variants. MustangFORD has a deceivingly huge trunk and good horse power. If you want reliability, Toyota LexusTOYOTA should be your choice. LexusTOYOTA has good body style too
***************
Processing Sentence 1
Considering the harsh winters here, I am considering 2014 Nissan MuranoNISSAN or the '14 Subaru ForesterSUBARU
***************
Processing Sentence 2
Among used cars, I am still not sure what to choose - CivicHONDA or CorollaTOYOTA?
***************
2. Externalizing Rules from Codes
2A. Saving spacy rules in a json format
Note that the official spacy document advocates josnl format but json is much more readable for multitoken spacy patterns
2B. Prepare an Entity Ruler loading rules from a json
loaded_spacy_patterns = json.load(open('spacy_rules_ner/phone_patterns.json','r',encoding='utf-8'))print("Asserting if the loaded spacy patterns are same as the prepared")assert loaded_spacy_patterns == spacy_patterns_dict_listprint("--> Assertion successful")
Asserting if the loaded spacy patterns are same as the prepared
--> Assertion successful
phone_nlp = spacy.load('en_core_web_sm', disable=['ner'])rules_config = {"validate": True,"overwrite_ents": True,}phone_nlp_rules = phone_nlp.add_pipe("entity_ruler", # invoke entity_ruler pipe "phone_nlp_rules", # give a name to the pipe config=rules_config)phone_nlp_rules.add_patterns(loaded_spacy_patterns) # load patterns to the `phone_nlp_rules` pipe of `phone_nlp` model
2C. Testing on Sample Phone Sentences
Code
phones_list = ["+1 (901)-985-4567", "+1(901)-985-4567","(901) 985 4567","9019854567" ]sample_phone_sentences = [f"If you want to talk more. Reach me at {each}"for each in phones_list]sample_phone_sentences
['If you want to talk more. Reach me at +1 (901)-985-4567',
'If you want to talk more. Reach me at +1(901)-985-4567',
'If you want to talk more. Reach me at (901) 985 4567',
'If you want to talk more. Reach me at 9019854567']
for i, doc inenumerate(phone_nlp.pipe(sample_phone_sentences, as_tuples=False )):print(f"\033[1mProcessing Sentence {i}\033[1m") displacy.render(doc, style='ent', )
Processing Sentence 0
If you want to talk more. Reach me at
+1 (901)-985-4567
PHONE
Processing Sentence 1
If you want to talk more. Reach me at
+1(901)-985-4567
PHONE
Processing Sentence 2
If you want to talk more. Reach me at
(901) 985 4567
PHONE
Processing Sentence 3
If you want to talk more. Reach me at
9019854567
PHONE
By adding the custom component, we are able to capture the missed out sentences also
3B. Chaining Spacy NER components
Chaining spacy NER components makes the patterns more manageable It is similar to modular programming but for building a complex spacy NER rules
Let us discuss creation of a first degree NER + second degree NER (chaining NER) written on top of first degree NER entities
It is better to train a model for ADDRESS entity. But for the sake of explanation of the Chiaining NER technique, let us build an ADDRESS NER using spacy rules
Code
# Goal: To capture the different types of ADDRESSES in the following ADDRESS SENTENCESsample_address_sentences = ['My office is located at 1 American rd, Dearborn, MI 48126, United States','My office is located at one American Road Dearborb Michigan 48126 United States','My office is located at 1 American High way, South Dakota 48126, United States','My office is located at 1 American rd, Dearborn, MI, United States','My office is located at 717 N 2ND ST, MANKATO, MN 56001 US','My office is located at 717 N 2ND ST, MANKATO, MN, 56001','My office is located at 717 N 2ND ST MANKATO MN 56001','My office is located at Dearborn Michigan','My office is located at Chennai, TamilNadu','My office is located at Dearborn, Michigan','My office is located at PO Box 107050, Albany, NY 12201-7050','My office is located at PO Box 107050, Albany, NY 12201','My office is located at P.O. Box 107050, Albany, NY 12201-7050','My office is located at P.O. Box 107050, Albany, NY 12201']
Code
# Capture following first degree NER entities# `DOOR_NUM` entity to capture # `STREET_NAME` entity# `CITY`# `STATE`# `COUNTRY`# `ZIP_CODE`# `P_O_BOX`# one or more of the above 1st degree NER entities form the final ADDRESS entitydoor_num_spacy_pattern = [{"label":"POTENTIAL_DOOR_NUM","pattern":[{"LOWER":{"REGEX":"\\b([0-9]{1,4}|one|two|three|four|five|six|seven|eight|nine|ten)\\b"}}] }]street_spacy_pattern = [{"label":"POTENTIAL_STREET_NAME","pattern":[{"TEXT":{"REGEX": "^(N|S|E|W)$"},"OP":"?"}, {"TEXT":{"REGEX":"^[A-Z][a-zA-Z]+$|\d(st|nd|rd|th|ST|ND|RD|TH)|[Ff]irst|[Ss]econd|[Tt]hird|[Ff]ourth|[Ff]ifth|[Ss]ixth|[Ss]eventh|[Ee]ighth|[Nn]inth|[Tt]enth"},"OP":"+"}, {"LOWER":{"REGEX":"\\b(street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|park|parkway|pkwy|circle|cir|boulevard|blvd|high|park|way|cross)\\b"},"OP":"+"}] }]city_or_country_spacy_pattern = [{"label":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","pattern":[{"TEXT":{"REGEX":"^[A-Z][a-zA-Z]+$"},"OP":"+", "TAG":{"REGEX":"^NN|HYPH"},}], }]zip_code_spacy_pattern = [{"label":"ZIP_CODE","pattern": [{"TEXT":{"REGEX":"\\b\\d{5}\\b"}}, {"ORTH":"-","OP":"?"}, {"TEXT":{"REGEX":"\\b\\d+\\b"},"OP":"?"} ] }] p_o_box_pattern = [{"label":"P_O_BOX","pattern":[{"LOWER":{"IN":["po","p.o","p.o.","post"]}}, {"LOWER":{"IN":["office","."]},"OP":"?"}, {"LOWER":{"IN":["box"]}}, {"TEXT":{"REGEX":"\\b\\d+\\b"}} ] }]first_degree_address_patterns = door_num_spacy_pattern + street_spacy_pattern + city_or_country_spacy_pattern + p_o_box_pattern + zip_code_spacy_patterndef create_first_degree_address_nlp(first_degree_address_patterns): first_degree_address_nlp = spacy.load('en_core_web_sm', disable=['ner']) rules_config = {"validate": True,"overwrite_ents": True, } first_degree_rules = first_degree_address_nlp.add_pipe("entity_ruler","first_degree_rules", config=rules_config) first_degree_rules.add_patterns(first_degree_address_patterns)return first_degree_address_nlp
My office is located at
1
POTENTIAL_DOOR_NUM
American rd
POTENTIAL_STREET_NAME
,
Dearborn
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
,
MI
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
48126
ZIP_CODE
,
United States
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
Processing Sentence 1
My office is located at
one
POTENTIAL_DOOR_NUM
American Road Dearborb Michigan
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
48126
ZIP_CODE
United States
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
Processing Sentence 2
My office is located at
1
POTENTIAL_DOOR_NUM
American High way
POTENTIAL_STREET_NAME
,
South Dakota
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
48126
ZIP_CODE
,
United States
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
Processing Sentence 3
My office is located at
1
POTENTIAL_DOOR_NUM
American rd
POTENTIAL_STREET_NAME
,
Dearborn
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
,
MI
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
,
United States
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
Processing Sentence 4
My office is located at
717
POTENTIAL_DOOR_NUM
N 2ND ST
POTENTIAL_STREET_NAME
,
MANKATO
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
,
MN
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
56001
ZIP_CODE
US
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
Processing Sentence 5
My office is located at
717
POTENTIAL_DOOR_NUM
N 2ND ST
POTENTIAL_STREET_NAME
,
MANKATO
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
,
MN
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
,
56001
ZIP_CODE
Processing Sentence 6
My office is located at
717
POTENTIAL_DOOR_NUM
N 2ND ST
POTENTIAL_STREET_NAME
MANKATO MN
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
56001
ZIP_CODE
Processing Sentence 7
My office is located at
Dearborn Michigan
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
Processing Sentence 8
My office is located at
Chennai
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
,
TamilNadu
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
Processing Sentence 9
My office is located at
Dearborn
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
,
Michigan
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
Processing Sentence 10
My office is located at
PO Box 107050
P_O_BOX
,
Albany
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
,
NY
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
12201-7050
ZIP_CODE
Processing Sentence 11
My office is located at
PO Box 107050
P_O_BOX
,
Albany
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
,
NY
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
12201
ZIP_CODE
Processing Sentence 12
My office is located at
P.O. Box 107050
P_O_BOX
,
Albany
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
,
NY
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
12201-7050
ZIP_CODE
Processing Sentence 13
My office is located at
P.O. Box 107050
P_O_BOX
,
Albany
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
,
NY
POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
12201
ZIP_CODE
# pattern for: 1 American rd, Dearborn, MI 48126, United States# pattenr for: 1 American High way, South Dakota 48126, United States# pattern for: 1 American rd, Dearborn, MI, United States# pattern for: 717 N 2ND ST, MANKATO, MN 56001 USADDRESS_PATTERN_1 = [{"label":"ADDRESS","pattern":[{"ENT_TYPE":"POTENTIAL_DOOR_NUM"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"POTENTIAL_STREET_NAME","OP":"+"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"*"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"*"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"ZIP_CODE","OP":"*"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"}, ] }, ]# pattern for: one American Road Dearborb Michigan 48126 United StatesADDRESS_PATTERN_2 = [{"label":"ADDRESS","pattern":[{"ENT_TYPE":"POTENTIAL_DOOR_NUM"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"ZIP_CODE","OP":"*"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"}, ] }, ]# 717 N 2ND ST, MANKATO, MN, 56001# 717 N 2ND ST MANKATO MN 56001ADDRESS_PATTERN_3 = [{"label": "ADDRESS","pattern":[{"ENT_TYPE":"POTENTIAL_DOOR_NUM"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"POTENTIAL_STREET_NAME","OP":"+"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"ZIP_CODE","OP":"+"}, ] } ]# Chennai, TamilNadu# Dearborn MichiganADDRESS_PATTERN_4 = [{"label": "ADDRESS","pattern":[{"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, ] } ]# PO Box 107050, Albany, NY 12201-7050# PO Box 107050, Albany, NY 12201# PO Box 107050, Albany, NY, US 12201ADDRESS_PATTERN_5 = [{"label": "ADDRESS","pattern":[{"ENT_TYPE":"P_O_BOX","OP":"+"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"?"}, {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, {"ENT_TYPE":"ZIP_CODE","OP":"+"}, ] } ]second_degree_address_patterns = ADDRESS_PATTERN_1 + ADDRESS_PATTERN_2 + ADDRESS_PATTERN_3 + ADDRESS_PATTERN_4 + ADDRESS_PATTERN_5def create_second_degree_address_nlp(first_degree_address_patterns, second_degree_address_patterns ): second_degree_address_nlp = spacy.load('en_core_web_sm', disable=['ner']) rules_config = {"validate": True,"overwrite_ents": True, } first_degree_rules = second_degree_address_nlp.add_pipe("entity_ruler","first_degree_rules", config=rules_config) first_degree_rules.add_patterns(first_degree_address_patterns) second_degree_rules = second_degree_address_nlp.add_pipe("entity_ruler","second_degree_rules", config=rules_config, after='first_degree_rules') second_degree_rules.add_patterns(second_degree_address_patterns)return second_degree_address_nlp
%%writefile spacy_rules_ner/phone_functions.pyimport jsonimport spacyimport re from spacy import Languagefrom spacy.tokens import Spanfrom spacy.util import filter_spansdef load_entity_ruler_based_phone_pattern(location_spacy_json): loaded_spacy_patterns = json.load(open(location_spacy_json,'r',encoding='utf-8')) phone_nlp = spacy.load('en_core_web_sm', disable=['ner']) rules_config = {"validate": True,"overwrite_ents": True, } phone_nlp_rules = phone_nlp.add_pipe("entity_ruler", # invoke entity_ruler pipe "phone_nlp_rules", # give a name to the pipe config=rules_config) phone_nlp_rules.add_patterns(loaded_spacy_patterns) # load patterns to the `phone_nlp_rules` pipe of `phone_nlp` modelreturn phone_nlplocation_spacy_json ='spacy_rules_ner/phone_patterns.json'phone_nlp = load_entity_ruler_based_phone_pattern(location_spacy_json)print("Pipeline Components before adding regex custom component:")print(phone_nlp.pipe_names)print()print("Entities tracked in phone_nlp_rules")print(phone_nlp.pipe_labels['phone_nlp_rules'])phone_regex_pattern =r"([+]?[\d]?[\d]?.?[(]?\d{3}[)]?.?\d{3}.?\d{4})"#noting the token position for every character in a docdef generate_chars2tokens_dict(doc): chars_to_tokens = {}for token in doc:for i inrange(token.idx, token.idx +len(token.text)): chars_to_tokens[i] = token.ireturn chars_to_tokens@Language.component("phone_multitoken_regex_capture")def phone_multitoken_regex_capture(doc): original_ents =list(doc.ents) chars_to_tokens = generate_chars2tokens_dict(doc) phone_regex_ents = []for match in re.finditer(phone_regex_pattern, doc.text): start, end = match.span() span = doc.char_span(start, end)if span isnotNone: phone_regex_ents.append((span.start, span.end, span.text))else: start_token = chars_to_tokens.get(start) end_token = chars_to_tokens.get(end)if start_token isnotNoneand end_token isnotNone: span = doc[start_token:end_token +1] phone_regex_ents.append((span.start, span.end, span.text))for regex_ent in phone_regex_ents: start_char, end_char, span_text = regex_ent proper_spacy_ent = Span(doc, start_char, end_char, label="PHONE") original_ents.append(proper_spacy_ent) filtered = filter_spans(original_ents) #removes overlapping ents doc.ents = filteredreturn docphone_nlp.add_pipe("phone_multitoken_regex_capture", after="phone_nlp_rules")print("Pipeline Components after adding regex custom component:")print(phone_nlp.pipe_names)# inspiration for the above code piece: # https://spacy.io/usage/rule-based-matching#regex-text
Overwriting spacy_rules_ner/phone_functions.py
# create the output_dir!mkdir -p ./spacy_rules_ner/packaged_phone_nlp
# now let us package the `phone_nlp`!python3 -m spacy package ./spacy_rules_ner/phone_nlp ./spacy_rules_ner/packaged_phone_nlp --code spacy_rules_ner/phone_functions.py --name phone_nlp_2