How to Leverage Spacy Rules NER

This blog post outlines the important features in Spacy Rules NER
NLP
Coding
Python
Author

Senthil Kumar

Published

May 9, 2021

Introduction to SpaCy and NER

About SpaCy

  • SpaCy is a NLP library offering easy-to-use Python API for many information extraction and machine learning tasks in text data
  • They are internally written in Cython and hence occupies low memory foot print with its small models and are quite fast with decent accuracy

Source: - more about spaCy

What is NER

Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories

Source: Wikipedia Article on NER

In information extraction, a named entity is a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper noun.
For example, in the sentence - “Biden is the president of the United States”,
“Biden” and “the United States” are named entities (proper nouns). “president” is not a named entity

Source: Wikipedia Article on Named Entities

The Spacy Version used here

Code
!python3 -m spacy validate
✔ Loaded compatibility table

================= Installed pipeline packages (spaCy v3.1.3) =================
ℹ spaCy installation: /usr/local/lib/python3.7/dist-packages/spacy

NAME             SPACY            VERSION                            
en_core_web_sm   >=3.1.0,<3.2.0   3.1.0   ✔

1. Basic Featues of SpaCy Rules NER

1A. About Token Matcher

  • Spacy’s Token Matcher lets you prepare Spacy Rules at token level invoking all complex
  • Example 1 from Spacy Documentation
patterns = [
    [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}], # captures any case variant of "hello-world", "hello!world"
    [{"LOWER": "hello"}, {"LOWER": "world"}] # captures any case variant of "hello world"
]
  • Example 2 using Token-level Regex
pattern = [{"TEXT": {"REGEX": "^[Uu](\.?|nited)$"}}, #
           {"TEXT": {"REGEX": "^[Ss](\.?|tates)$"}},
           {"LOWER": "president"}] # captures (U.S or US or United States or united states) President

1B. About Phrase Matcher

  • If we have a huge list of phrases in a list or in a csv file, phrase matcher can be applied directly
import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."]
# Only run nlp.make_doc to speed things up
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TerminologyList", patterns)

doc = nlp("German Chancellor Angela Merkel and US President Barack Obama "
          "converse in the Oval Office inside the White House in Washington, D.C.")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

source: https://spacy.io/usage/rule-based-matching#adding-phrase-patterns

1C. Explaining Token and Phrase Matchers with a MODEL_NAMES NER Capture

Code
# goal: to capture Car Models to their corresponding Makes
sample_sentence_1 = "Go for ford mustang mache variants. Mustang has a deceivingly huge trunk and good horse power. If you want reliability, Toyota Lexus should be your choice. Lexus has good body style too"
sample_sentence_2 = "Considering the harsh winters here, I am considering 2014 Nissan Murano or the '14 Subaru Forester"
sample_sentence_3 = "Among used cars, I am still not sure what to choose - Civic or Corolla?"

sample_models_sentences = [sample_sentence_1, 
                    sample_sentence_2,
                    sample_sentence_3
                   ]
Code
import spacy
from spacy.matcher import Matcher, PhraseMatcher
from spacy.tokens import Span
from spacy.util import filter_spans
from spacy import displacy

model_names_nlp = spacy.load('en_core_web_sm',disable=['ner'])
matcher = Matcher(model_names_nlp.vocab)

# pattern rules matching every token
ford_pattern = [{"LOWER": "ford", "OP":"?"},
                   {"LOWER": "mustang"},
                   {"LOWER":{"IN":["mache","gt","bulitt"]},"OP":"*"}
                  ]
toyota_pattern = [{"LOWER": "toyota","OP":"?"},
                 {"LOWER": {"IN":["lexus","corolla","camry"]}}
                ]

honda_pattern = [{"LOWER": "honda","OP":"?"},
                 {"LOWER": {"IN":["civic","accord"]}}
                ]

token_matcher_patterns = {"FORD": ford_pattern,
                          "TOYOTA": toyota_pattern,
                          "HONDA": honda_pattern,
                         }

# phrase pattern looks for exact match
nissan_phrase_pattern = ["Nissan Murano", "Murano", "murano", "nissan murano"]
subaru_phrase_pattern = ["Subaru Forester", "Forester", "forester", "subaru forester"]

phrase_matcher_patterns = {"NISSAN": nissan_phrase_pattern,
                           "SUBARU": subaru_phrase_pattern
                          }

def add_token_matcher_and_phrase_matcher_patterns(nlp_model,
                                                  token_patterns_dict=token_matcher_patterns,
                                                  phrase_patterns_dict=phrase_matcher_patterns
                                                 ):
    token_matcher = Matcher(nlp_model.vocab)
    for key, value in token_patterns_dict.items():
        token_matcher.add(key,[value])
        
    phrase_matcher = PhraseMatcher(nlp_model.vocab)
    for key, terms_list in phrase_patterns_dict.items():
        phrase_patterns = [nlp_model.make_doc(text) for text in terms_list]
        phrase_matcher.add(key, phrase_patterns)
    return token_matcher, phrase_matcher

doc = model_names_nlp(sample_sentence_1)

def modify_doc(token_matcher,
               phrase_matcher,
               doc):
    original_ents = list(doc.ents)
    matches = token_matcher(doc) + phrase_matcher(doc)
    for match_id, start, end in matches:
        span = Span(doc, start, end, match_id)
        original_ents.append(span)
    filtered = filter_spans(original_ents)
    doc.ents = filtered
    return doc
Code
token_matcher, phrase_matcher = add_token_matcher_and_phrase_matcher_patterns(model_names_nlp, 
                                                                              token_matcher_patterns,
                                                                              phrase_matcher_patterns
                                                                             )
modelnames_dict = {
    "HONDA": "#ffcccb",  # light red (pink)
    "TOYOTA": "#A865C9",  # light purple
    "NISSAN": "#FFD580",  # light orange
    "SUBARU": " #FFCCCB",  # green
    "FORD": "#ADD8E6"  # light blue
}

models = list(modelnames_dict.keys())

options_dict = {"ents": models,
           "colors": modelnames_dict
          }



for i, doc in enumerate(model_names_nlp.pipe(sample_models_sentences,
             as_tuples=False
            )):
    new_doc = modify_doc(token_matcher,
                         phrase_matcher,
                         doc
                        )
    print(f"\033[1mProcessing Sentence {i} \033[1m")
    displacy.render(new_doc,
                    style='ent',
                    options=options_dict,
                    minify=True
                   )
    print("***************")
Processing Sentence 0 
***************
Processing Sentence 1 
***************
Processing Sentence 2 
***************
Go for ford mustang macheFORD variants. MustangFORD has a deceivingly huge trunk and good horse power. If you want reliability, Toyota LexusTOYOTA should be your choice. LexusTOYOTA has good body style too
Considering the harsh winters here, I am considering 2014 Nissan MuranoNISSAN or the '14 Subaru ForesterSUBARU
Among used cars, I am still not sure what to choose - CivicHONDA or CorollaTOYOTA?

2. Externalizing Rules from Codes

2A. Saving spacy rules in a json format

Note that the official spacy document advocates josnl format but json is much more readable for multitoken spacy patterns

Code
import json 
from pprint import pprint

# "+1 (901)-985-4567" or "(901)-985-4567" or 901-985-4567 or 901 985 4567
phone_pattern_1 = [{"ORTH": {"IN":["+1","1"]},'OP':'?'},
                   {"ORTH": '(', "OP":"?"}, 
                   {'SHAPE': 'ddd'}, 
                   {"ORTH": ')', "OP":"?"}, 
                   {'ORTH': '-', 'OP':'?'},
                   {'SHAPE': 'ddd'}, 
                   {'ORTH': '-', 'OP':'?'}, 
                   {'SHAPE': 'dddd'}]

# 901 985 4567
phone_pattern_2 = [{"TEXT": {"REGEX": "\d{10}"}}]

# +19019854567
phone_pattern_3 = [{"TEXT": {"REGEX": "\+1\d{10}"}}]

phone_patterns_list = [phone_pattern_1, phone_pattern_2, phone_pattern_3]

spacy_patterns_dict_list = []

for each_phone_pattern in phone_patterns_list:
    spacy_patterns_dict_list.append({"label":"PHONE",
                                     "pattern": each_phone_pattern
                                    })

with open('spacy_rules_ner/phone_patterns.json', 'w', encoding='utf-8') as f:
    json.dump(spacy_patterns_dict_list, f, ensure_ascii=False, indent=1)

2B. Prepare an Entity Ruler loading rules from a json

loaded_spacy_patterns = json.load(open('spacy_rules_ner/phone_patterns.json','r',encoding='utf-8'))

print("Asserting if the loaded spacy patterns are same as the prepared")
assert loaded_spacy_patterns == spacy_patterns_dict_list
print("--> Assertion successful")
Asserting if the loaded spacy patterns are same as the prepared
--> Assertion successful
print("Token-level Spacy Phone Regex")
pprint(loaded_spacy_patterns)
Token-level Spacy Phone Regex
[{'label': 'PHONE',
  'pattern': [{'OP': '?', 'ORTH': {'IN': ['+1', '1']}},
              {'OP': '?', 'ORTH': '('},
              {'SHAPE': 'ddd'},
              {'OP': '?', 'ORTH': ')'},
              {'OP': '?', 'ORTH': '-'},
              {'SHAPE': 'ddd'},
              {'OP': '?', 'ORTH': '-'},
              {'SHAPE': 'dddd'}]},
 {'label': 'PHONE', 'pattern': [{'TEXT': {'REGEX': '\\d{10}'}}]},
 {'label': 'PHONE', 'pattern': [{'TEXT': {'REGEX': '\\+1\\d{10}'}}]}]
phone_nlp = spacy.load('en_core_web_sm', disable=['ner'])
rules_config = {
    "validate": True,
    "overwrite_ents": True,
}

phone_nlp_rules = phone_nlp.add_pipe("entity_ruler", # invoke entity_ruler pipe 
                                     "phone_nlp_rules", # give a name to the pipe
                                     config=rules_config)
phone_nlp_rules.add_patterns(loaded_spacy_patterns) # load patterns to the `phone_nlp_rules` pipe of `phone_nlp` model

2C. Testing on Sample Phone Sentences

Code
phones_list = ["+1 (901)-985-4567", 
               "+1(901)-985-4567",
               "(901) 985 4567",
               "9019854567"
              ]



sample_phone_sentences = [f"If you want to talk more. Reach me at {each}" for each in phones_list]
sample_phone_sentences
['If you want to talk more. Reach me at +1 (901)-985-4567',
 'If you want to talk more. Reach me at +1(901)-985-4567',
 'If you want to talk more. Reach me at (901) 985 4567',
 'If you want to talk more. Reach me at 9019854567']
Code
import warnings
warnings.filterwarnings('ignore')

for i, doc in enumerate(phone_nlp.pipe(sample_phone_sentences,
             as_tuples=False
            )):
    print(f"\033[1mProcessing Sentence {i} \033[1m")
    displacy.render(doc,
                    style='ent',
                   )
Processing Sentence 0 
Processing Sentence 1 
Processing Sentence 2 
Processing Sentence 3 
If you want to talk more. Reach me at +1 (901)-985-4567
If you want to talk more. Reach me at +1(901)-985-4567
If you want to talk more. Reach me at (901) 985 4567 PHONE
If you want to talk more. Reach me at 9019854567 PHONE

Some of the phone patterns are not captured using the above patterns, let us add a Phone regex that cuts across multiple tokens

text = "If you want to talk more. Reach me at +1 (901)-985-4567"
doc = phone_nlp(text)
[(token.text, token.ent_type_) for token in doc]
[('If', ''),
 ('you', ''),
 ('want', ''),
 ('to', ''),
 ('talk', ''),
 ('more', ''),
 ('.', ''),
 ('Reach', ''),
 ('me', ''),
 ('at', ''),
 ('+1', ''),
 ('(', ''),
 ('901)-985', ''),
 ('-', ''),
 ('4567', '')]

3. Advanced Features in Spacy Rules NER

3A. Adding RegEx patterns as a custom component

Code
import re 
from spacy import Language
from spacy.tokens import Span

def load_entity_ruler_based_phone_pattern(location_spacy_json):
    loaded_spacy_patterns = json.load(open(location_spacy_json,'r',encoding='utf-8'))
    phone_nlp = spacy.load('en_core_web_sm', disable=['ner'])
    rules_config = {
        "validate": True,
        "overwrite_ents": True,
    }

    phone_nlp_rules = phone_nlp.add_pipe("entity_ruler", # invoke entity_ruler pipe 
                                         "phone_nlp_rules", # give a name to the pipe
                                         config=rules_config)
    phone_nlp_rules.add_patterns(loaded_spacy_patterns) # load patterns to the `phone_nlp_rules` pipe of `phone_nlp` model
    return phone_nlp
       
location_spacy_json = 'spacy_rules_ner/phone_patterns.json'

phone_nlp = load_entity_ruler_based_phone_pattern(location_spacy_json)

print("Pipeline Components before adding regex custom component:")
print(phone_nlp.pipe_names)
print()
print("Entities tracked in phone_nlp_rules")
print(phone_nlp.pipe_labels['phone_nlp_rules'])

phone_regex_pattern = r"([+]?[\d]?[\d]?.?[(]?\d{3}[)]?.?\d{3}.?\d{4})"

#noting the token position for every character in a doc

def generate_chars2tokens_dict(doc):
    chars_to_tokens = {}
    for token in doc:
        for i in range(token.idx, token.idx + len(token.text)):
            chars_to_tokens[i] = token.i
    return chars_to_tokens


@Language.component("phone_multitoken_regex_capture")
def phone_multitoken_regex_capture(doc):
    original_ents = list(doc.ents)
    chars_to_tokens = generate_chars2tokens_dict(doc)
    phone_regex_ents = []
    for match in re.finditer(phone_regex_pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span is not None:
            phone_regex_ents.append((span.start, span.end, span.text))
        else:
            start_token = chars_to_tokens.get(start)
            end_token = chars_to_tokens.get(end)
            if start_token is not None and end_token is not None:
                span = doc[start_token:end_token + 1]
                phone_regex_ents.append((span.start, span.end, span.text))
    for regex_ent in phone_regex_ents:
        start_char, end_char, span_text = regex_ent
        proper_spacy_ent = Span(doc, start_char, end_char, label="PHONE")
        original_ents.append(proper_spacy_ent)     
    filtered = filter_spans(original_ents) #removes overlapping ents
    doc.ents = filtered
    return doc


phone_nlp.add_pipe("phone_multitoken_regex_capture", after="phone_nlp_rules")

print("Pipeline Components after adding regex custom component:")
print(phone_nlp.pipe_names)

# inspiration for the above code piece: 
# https://spacy.io/usage/rule-based-matching#regex-text
Pipeline Components before adding regex custom component:
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'phone_nlp_rules']

Entities tracked in phone_nlp_rules
['PHONE']
Pipeline Components after adding regex custom component:
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'phone_nlp_rules', 'phone_multitoken_regex_capture']
for i, doc in enumerate(phone_nlp.pipe(sample_phone_sentences,
             as_tuples=False
            )):
    print(f"\033[1mProcessing Sentence {i} \033[1m")
    displacy.render(doc,
                    style='ent',
                   )
Processing Sentence 0 
Processing Sentence 1 
Processing Sentence 2 
Processing Sentence 3 
If you want to talk more. Reach me at +1 (901)-985-4567 PHONE
If you want to talk more. Reach me at +1(901)-985-4567 PHONE
If you want to talk more. Reach me at (901) 985 4567 PHONE
If you want to talk more. Reach me at 9019854567 PHONE

By adding the custom component, we are able to capture the missed out sentences also

3B. Chaining Spacy NER components

Chaining spacy NER components makes the patterns more manageable
It is similar to modular programming but for building a complex spacy NER rules

Let us discuss creation of a first degree NER + second degree NER (chaining NER) written on top of first degree NER entities

It is better to train a model for ADDRESS entity.
But for the sake of explanation of the Chiaining NER technique, let us build an ADDRESS NER using spacy rules

Code
# Goal: To capture the different types of ADDRESSES in the following ADDRESS SENTENCES
sample_address_sentences = ['My office is located at 1 American rd, Dearborn, MI 48126, United States',
 'My office is located at one American Road Dearborb Michigan 48126 United States',
 'My office is located at 1 American High way, South Dakota 48126, United States',
 'My office is located at 1 American rd, Dearborn, MI, United States',
 'My office is located at 717 N 2ND ST, MANKATO, MN 56001 US',
 'My office is located at 717 N 2ND ST, MANKATO, MN, 56001',
 'My office is located at 717 N 2ND ST MANKATO MN 56001',
 'My office is located at Dearborn Michigan',
 'My office is located at Chennai, TamilNadu',
 'My office is located at Dearborn, Michigan',
 'My office is located at PO Box 107050, Albany, NY 12201-7050',
 'My office is located at PO Box 107050, Albany, NY 12201',
 'My office is located at P.O. Box 107050, Albany, NY 12201-7050',
 'My office is located at P.O. Box 107050, Albany, NY 12201']
Code
# Capture following first degree NER entities


# `DOOR_NUM` entity to capture 
# `STREET_NAME` entity
# `CITY`
# `STATE`
# `COUNTRY`
# `ZIP_CODE`
# `P_O_BOX`


# one or more of the above 1st degree NER entities form the final ADDRESS entity

door_num_spacy_pattern = [{"label":"POTENTIAL_DOOR_NUM",
                           "pattern":[{"LOWER":{"REGEX":"\\b([0-9]{1,4}|one|two|three|four|five|six|seven|eight|nine|ten)\\b"}}]
                         }]

street_spacy_pattern = [{"label":"POTENTIAL_STREET_NAME",
                         "pattern":[{"TEXT":{"REGEX": "^(N|S|E|W)$"},"OP":"?"},
                                    {"TEXT":{"REGEX":"^[A-Z][a-zA-Z]+$|\d(st|nd|rd|th|ST|ND|RD|TH)|[Ff]irst|[Ss]econd|[Tt]hird|[Ff]ourth|[Ff]ifth|[Ss]ixth|[Ss]eventh|[Ee]ighth|[Nn]inth|[Tt]enth"},"OP":"+"},
                                    {"LOWER":{"REGEX":"\\b(street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|park|parkway|pkwy|circle|cir|boulevard|blvd|high|park|way|cross)\\b"},"OP":"+"}]
                       }]

city_or_country_spacy_pattern = [{"label":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME",
                                 "pattern":[{"TEXT":{"REGEX":"^[A-Z][a-zA-Z]+$"},"OP":"+", "TAG":{"REGEX":"^NN|HYPH"},}],
                               }]

zip_code_spacy_pattern =  [{"label":"ZIP_CODE",
                            "pattern": [{"TEXT":{"REGEX":"\\b\\d{5}\\b"}},
                                        {"ORTH":"-","OP":"?"},
                                        {"TEXT":{"REGEX":"\\b\\d+\\b"},"OP":"?"} 
                                       ]  
                                }] 

p_o_box_pattern = [{"label":"P_O_BOX",
                     "pattern":[{"LOWER":{"IN":["po","p.o","p.o.","post"]}},
                                {"LOWER":{"IN":["office","."]},"OP":"?"},
                                {"LOWER":{"IN":["box"]}},
                                {"TEXT":{"REGEX":"\\b\\d+\\b"}}
                               ]  
                   }]

first_degree_address_patterns = door_num_spacy_pattern + street_spacy_pattern + city_or_country_spacy_pattern + p_o_box_pattern + zip_code_spacy_pattern

def create_first_degree_address_nlp(first_degree_address_patterns):
    first_degree_address_nlp = spacy.load('en_core_web_sm', disable=['ner'])
    rules_config = {"validate": True,
                    "overwrite_ents": True,
                   }
    first_degree_rules = first_degree_address_nlp.add_pipe("entity_ruler",
                                                           "first_degree_rules",
                                                           config=rules_config)
    first_degree_rules.add_patterns(first_degree_address_patterns)
    return first_degree_address_nlp                                
first_degree_address_nlp = create_first_degree_address_nlp(first_degree_address_patterns)

for i, doc in enumerate(first_degree_address_nlp.pipe(sample_address_sentences,
             as_tuples=False
            )):
    print(f"\033[1mProcessing Sentence {i} \033[1m")
    displacy.render(doc,
                    style='ent',
                   )
Processing Sentence 0 
Processing Sentence 1 
Processing Sentence 2 
Processing Sentence 3 
Processing Sentence 4 
Processing Sentence 5 
Processing Sentence 6 
Processing Sentence 7 
Processing Sentence 8 
Processing Sentence 9 
Processing Sentence 10 
Processing Sentence 11 
Processing Sentence 12 
Processing Sentence 13 
My office is located at 1 POTENTIAL_DOOR_NUM American rd POTENTIAL_STREET_NAME , Dearborn POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME , MI POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME 48126 ZIP_CODE , United States POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
My office is located at one POTENTIAL_DOOR_NUM American Road Dearborb Michigan POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME 48126 ZIP_CODE United States POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
My office is located at 1 POTENTIAL_DOOR_NUM American High way POTENTIAL_STREET_NAME , South Dakota POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME 48126 ZIP_CODE , United States POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
My office is located at 1 POTENTIAL_DOOR_NUM American rd POTENTIAL_STREET_NAME , Dearborn POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME , MI POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME , United States POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
My office is located at 717 POTENTIAL_DOOR_NUM N 2ND ST POTENTIAL_STREET_NAME , MANKATO POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME , MN POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME 56001 ZIP_CODE US POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
My office is located at 717 POTENTIAL_DOOR_NUM N 2ND ST POTENTIAL_STREET_NAME , MANKATO POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME , MN POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME , 56001 ZIP_CODE
My office is located at 717 POTENTIAL_DOOR_NUM N 2ND ST POTENTIAL_STREET_NAME MANKATO MN POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME 56001 ZIP_CODE
My office is located at Dearborn Michigan POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
My office is located at Chennai POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME , TamilNadu POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
My office is located at Dearborn POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME , Michigan POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME
My office is located at PO Box 107050 P_O_BOX , Albany POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME , NY POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME 12201-7050 ZIP_CODE
My office is located at PO Box 107050 P_O_BOX , Albany POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME , NY POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME 12201 ZIP_CODE
My office is located at P.O. Box 107050 P_O_BOX , Albany POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME , NY POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME 12201-7050 ZIP_CODE
My office is located at P.O. Box 107050 P_O_BOX , Albany POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME , NY POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME 12201 ZIP_CODE
# pattern for: 1 American rd, Dearborn, MI 48126, United States
# pattenr for: 1 American High way, South Dakota 48126, United States
# pattern for: 1 American rd, Dearborn, MI, United States
# pattern for: 717 N 2ND ST, MANKATO, MN 56001 US
ADDRESS_PATTERN_1 = [{"label":"ADDRESS",
                     "pattern":[{"ENT_TYPE":"POTENTIAL_DOOR_NUM"},
                                {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                                {"ENT_TYPE":"POTENTIAL_STREET_NAME","OP":"+"},
                                {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                                {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"*"},
                                {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                                {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"*"},
                                {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                                {"ENT_TYPE":"ZIP_CODE","OP":"*"},
                                {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, 
                                {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"},
                               ]  
                   },
                   ]

# pattern for: one American Road Dearborb Michigan 48126 United States
ADDRESS_PATTERN_2 = [{"label":"ADDRESS",
                     "pattern":[{"ENT_TYPE":"POTENTIAL_DOOR_NUM"},
                                {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                                {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"},
                                {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                                {"ENT_TYPE":"ZIP_CODE","OP":"*"},
                                {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"}, 
                                {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"},
                               ]  
                   },
                   ]

# 717 N 2ND ST, MANKATO, MN, 56001
# 717 N 2ND ST MANKATO MN 56001
ADDRESS_PATTERN_3 = [{"label": "ADDRESS",
                      "pattern":[{"ENT_TYPE":"POTENTIAL_DOOR_NUM"},
                                 {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                                 {"ENT_TYPE":"POTENTIAL_STREET_NAME","OP":"+"},
                                 {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                                 {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"},
                                 {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                                 {"ENT_TYPE":"ZIP_CODE","OP":"+"},
                      ]
                     }
                    ]

# Chennai, TamilNadu
# Dearborn Michigan
ADDRESS_PATTERN_4 = [{"label": "ADDRESS",
                      "pattern":[{"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"},
                                 {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                      ]
                     }
                    ]

# PO Box 107050, Albany, NY 12201-7050
# PO Box 107050, Albany, NY 12201
# PO Box 107050, Albany, NY, US 12201
ADDRESS_PATTERN_5 = [{"label": "ADDRESS",
                      "pattern":[{"ENT_TYPE":"P_O_BOX","OP":"+"},
                                 {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                                 {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"},
                                 {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                                 {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"+"},
                                 {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                                 {"ENT_TYPE":"POTENTIAL_CITY_OR_STATE_OR_COUNTRY_NAME","OP":"?"},
                                 {"TEXT":{"REGEX":"\\W{1,2}"},"OP":"?"},
                                 {"ENT_TYPE":"ZIP_CODE","OP":"+"},
                                 
                      ]
                     }
                    ]

second_degree_address_patterns = ADDRESS_PATTERN_1 + ADDRESS_PATTERN_2 + ADDRESS_PATTERN_3 + ADDRESS_PATTERN_4 + ADDRESS_PATTERN_5

def create_second_degree_address_nlp(first_degree_address_patterns,
                                          second_degree_address_patterns
                                         ):
    second_degree_address_nlp = spacy.load('en_core_web_sm', disable=['ner'])
    rules_config = {"validate": True,
                    "overwrite_ents": True,
                   }
    first_degree_rules = second_degree_address_nlp.add_pipe("entity_ruler",
                                                           "first_degree_rules",
                                                           config=rules_config)
    first_degree_rules.add_patterns(first_degree_address_patterns)
    
    second_degree_rules = second_degree_address_nlp.add_pipe("entity_ruler",
                                                           "second_degree_rules",
                                                           config=rules_config,
                                                           after='first_degree_rules')
    second_degree_rules.add_patterns(second_degree_address_patterns)
    return second_degree_address_nlp
second_degree_address_nlp = create_second_degree_address_nlp(first_degree_address_patterns, second_degree_address_patterns)

for i, doc in enumerate(second_degree_address_nlp.pipe(sample_address_sentences,
             as_tuples=False
            )):
    print(f"\033[1mProcessing Sentence {i} \033[1m")
    displacy.render(doc,
                    style='ent',
                   )
Processing Sentence 0 
Processing Sentence 1 
Processing Sentence 2 
Processing Sentence 3 
Processing Sentence 4 
Processing Sentence 5 
Processing Sentence 6 
Processing Sentence 7 
Processing Sentence 8 
Processing Sentence 9 
Processing Sentence 10 
Processing Sentence 11 
Processing Sentence 12 
Processing Sentence 13 
My office is located at 1 American rd, Dearborn, MI 48126, United States ADDRESS
My office is located at one American Road Dearborb Michigan 48126 United States ADDRESS
My office is located at 1 American High way, South Dakota 48126, United States ADDRESS
My office is located at 1 American rd, Dearborn, MI, United States ADDRESS
My office is located at 717 N 2ND ST, MANKATO, MN 56001 US ADDRESS
My office is located at 717 N 2ND ST, MANKATO, MN ADDRESS , 56001 ZIP_CODE
My office is located at 717 N 2ND ST MANKATO MN 56001 ADDRESS
My office is located at Dearborn Michigan ADDRESS
My office is located at Chennai, ADDRESS TamilNadu ADDRESS
My office is located at Dearborn, ADDRESS Michigan ADDRESS
My office is located at PO Box 107050, Albany, NY 12201-7050 ADDRESS
My office is located at PO Box 107050, Albany, NY 12201 ADDRESS
My office is located at P.O. Box 107050, Albany, NY 12201-7050 ADDRESS
My office is located at P.O. Box 107050, Albany, NY 12201 ADDRESS

4. How to save Rules NER as a package

  • Save the rules ner model phone_nlp to a physical location using nlp.to_disk
  • Save the custom components in a py file spacy_rules_ner/phone_functions.py
  • use python -m spacy package input_dir output_dir --code location/to/custom_components.py --name new_model_name to generate .tar.gz format package
  • Pip install the tar.gz file using pip install location/to/tar.gz
  • spacy.load('new_model_name') will load your package with custom components
!mkdir -p spacy_rules_ner/phone_nlp
Code
# Let us save the `phone_nlp` and the custom component pipeline 

phone_nlp.to_disk('spacy_rules_ner/phone_nlp')
!ls spacy_rules_ner/phone_nlp
attribute_ruler  lemmatizer  ner     phone_nlp_rules  tagger   tokenizer
config.cfg   meta.json   parser  senter       tok2vec  vocab
%%writefile spacy_rules_ner/phone_functions.py

import json
import spacy
import re 
from spacy import Language
from spacy.tokens import Span
from spacy.util import filter_spans

def load_entity_ruler_based_phone_pattern(location_spacy_json):
    loaded_spacy_patterns = json.load(open(location_spacy_json,'r',encoding='utf-8'))
    phone_nlp = spacy.load('en_core_web_sm', disable=['ner'])
    rules_config = {
        "validate": True,
        "overwrite_ents": True,
    }

    phone_nlp_rules = phone_nlp.add_pipe("entity_ruler", # invoke entity_ruler pipe 
                                         "phone_nlp_rules", # give a name to the pipe
                                         config=rules_config)
    phone_nlp_rules.add_patterns(loaded_spacy_patterns) # load patterns to the `phone_nlp_rules` pipe of `phone_nlp` model
    return phone_nlp
       
location_spacy_json = 'spacy_rules_ner/phone_patterns.json'

phone_nlp = load_entity_ruler_based_phone_pattern(location_spacy_json)

print("Pipeline Components before adding regex custom component:")
print(phone_nlp.pipe_names)
print()
print("Entities tracked in phone_nlp_rules")
print(phone_nlp.pipe_labels['phone_nlp_rules'])

phone_regex_pattern = r"([+]?[\d]?[\d]?.?[(]?\d{3}[)]?.?\d{3}.?\d{4})"

#noting the token position for every character in a doc

def generate_chars2tokens_dict(doc):
    chars_to_tokens = {}
    for token in doc:
        for i in range(token.idx, token.idx + len(token.text)):
            chars_to_tokens[i] = token.i
    return chars_to_tokens


@Language.component("phone_multitoken_regex_capture")
def phone_multitoken_regex_capture(doc):
    original_ents = list(doc.ents)
    chars_to_tokens = generate_chars2tokens_dict(doc)
    phone_regex_ents = []
    for match in re.finditer(phone_regex_pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span is not None:
            phone_regex_ents.append((span.start, span.end, span.text))
        else:
            start_token = chars_to_tokens.get(start)
            end_token = chars_to_tokens.get(end)
            if start_token is not None and end_token is not None:
                span = doc[start_token:end_token + 1]
                phone_regex_ents.append((span.start, span.end, span.text))
    for regex_ent in phone_regex_ents:
        start_char, end_char, span_text = regex_ent
        proper_spacy_ent = Span(doc, start_char, end_char, label="PHONE")
        original_ents.append(proper_spacy_ent)     
    filtered = filter_spans(original_ents) #removes overlapping ents
    doc.ents = filtered
    return doc


phone_nlp.add_pipe("phone_multitoken_regex_capture", after="phone_nlp_rules")

print("Pipeline Components after adding regex custom component:")
print(phone_nlp.pipe_names)

# inspiration for the above code piece: 
# https://spacy.io/usage/rule-based-matching#regex-text
Overwriting spacy_rules_ner/phone_functions.py
# create the output_dir
!mkdir -p ./spacy_rules_ner/packaged_phone_nlp
# now let us package the `phone_nlp`

!python3 -m spacy package ./spacy_rules_ner/phone_nlp ./spacy_rules_ner/packaged_phone_nlp --code spacy_rules_ner/phone_functions.py --name phone_nlp_2
ℹ Building package artifacts: sdist
Pipeline Components before adding regex custom component:
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'phone_nlp_rules']

Entities tracked in phone_nlp_rules
['PHONE']
Pipeline Components after adding regex custom component:
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'phone_nlp_rules', 'phone_multitoken_regex_capture']
✔ Including 1 Python module(s) with custom code
✔ Loaded meta.json from file
spacy_rules_ner/phone_nlp/meta.json
✔ Generated README.md from meta.json
✔ Successfully created package 'en_phone_nlp_2-3.1.0'
spacy_rules_ner/packaged_phone_nlp/en_phone_nlp_2-3.1.0
running sdist
running egg_info
creating en_phone_nlp_2.egg-info
writing en_phone_nlp_2.egg-info/PKG-INFO
writing dependency_links to en_phone_nlp_2.egg-info/dependency_links.txt
writing entry points to en_phone_nlp_2.egg-info/entry_points.txt
writing requirements to en_phone_nlp_2.egg-info/requires.txt
writing top-level names to en_phone_nlp_2.egg-info/top_level.txt
writing manifest file 'en_phone_nlp_2.egg-info/SOURCES.txt'
reading manifest file 'en_phone_nlp_2.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching 'LICENSE'
warning: no files found matching 'LICENSES_SOURCES'
writing manifest file 'en_phone_nlp_2.egg-info/SOURCES.txt'
running check
creating en_phone_nlp_2-3.1.0
creating en_phone_nlp_2-3.1.0/en_phone_nlp_2
creating en_phone_nlp_2-3.1.0/en_phone_nlp_2.egg-info
creating en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0
creating en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/attribute_ruler
creating en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/lemmatizer
creating en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/lemmatizer/lookups
creating en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/ner
creating en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/parser
creating en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/phone_nlp_rules
creating en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/senter
creating en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/tagger
creating en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/tok2vec
creating en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/vocab
copying files to en_phone_nlp_2-3.1.0...
copying MANIFEST.in -> en_phone_nlp_2-3.1.0
copying README.md -> en_phone_nlp_2-3.1.0
copying meta.json -> en_phone_nlp_2-3.1.0
copying setup.py -> en_phone_nlp_2-3.1.0
copying en_phone_nlp_2/__init__.py -> en_phone_nlp_2-3.1.0/en_phone_nlp_2
copying en_phone_nlp_2/meta.json -> en_phone_nlp_2-3.1.0/en_phone_nlp_2
copying en_phone_nlp_2/phone_functions.py -> en_phone_nlp_2-3.1.0/en_phone_nlp_2
copying en_phone_nlp_2.egg-info/PKG-INFO -> en_phone_nlp_2-3.1.0/en_phone_nlp_2.egg-info
copying en_phone_nlp_2.egg-info/SOURCES.txt -> en_phone_nlp_2-3.1.0/en_phone_nlp_2.egg-info
copying en_phone_nlp_2.egg-info/dependency_links.txt -> en_phone_nlp_2-3.1.0/en_phone_nlp_2.egg-info
copying en_phone_nlp_2.egg-info/entry_points.txt -> en_phone_nlp_2-3.1.0/en_phone_nlp_2.egg-info
copying en_phone_nlp_2.egg-info/not-zip-safe -> en_phone_nlp_2-3.1.0/en_phone_nlp_2.egg-info
copying en_phone_nlp_2.egg-info/requires.txt -> en_phone_nlp_2-3.1.0/en_phone_nlp_2.egg-info
copying en_phone_nlp_2.egg-info/top_level.txt -> en_phone_nlp_2-3.1.0/en_phone_nlp_2.egg-info
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/README.md -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/config.cfg -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/meta.json -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/tokenizer -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/attribute_ruler/patterns -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/attribute_ruler
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/lemmatizer/lookups/lookups.bin -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/lemmatizer/lookups
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/ner/cfg -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/ner
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/ner/model -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/ner
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/ner/moves -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/ner
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/parser/cfg -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/parser
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/parser/model -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/parser
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/parser/moves -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/parser
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/phone_nlp_rules/cfg -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/phone_nlp_rules
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/phone_nlp_rules/patterns.jsonl -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/phone_nlp_rules
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/senter/cfg -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/senter
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/senter/model -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/senter
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/tagger/cfg -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/tagger
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/tagger/model -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/tagger
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/tok2vec/cfg -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/tok2vec
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/tok2vec/model -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/tok2vec
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/vocab/key2row -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/vocab
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/vocab/lookups.bin -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/vocab
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/vocab/strings.json -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/vocab
copying en_phone_nlp_2/en_phone_nlp_2-3.1.0/vocab/vectors -> en_phone_nlp_2-3.1.0/en_phone_nlp_2/en_phone_nlp_2-3.1.0/vocab
Writing en_phone_nlp_2-3.1.0/setup.cfg
creating dist
Creating tar archive
removing 'en_phone_nlp_2-3.1.0' (and everything under it)
✔ Successfully created zipped Python package
spacy_rules_ner/packaged_phone_nlp/en_phone_nlp_2-3.1.0/dist/en_phone_nlp_2-3.1.0.tar.gz

The generated tar.gz files can be shared and pip installed

!pip install spacy_rules_ner/packaged_phone_nlp/en_phone_nlp_2-3.1.0/dist/en_phone_nlp_2-3.1.0.tar.gz
Defaulting to user installation because normal site-packages is not writeable
Processing ./spacy_rules_ner/packaged_phone_nlp/en_phone_nlp_2-3.1.0/dist/en_phone_nlp_2-3.1.0.tar.gz
  Preparing metadata (setup.py) ... done
Requirement already satisfied: spacy<3.2.0,>=3.1.3 in /usr/local/lib/python3.7/dist-packages (from en-phone-nlp-2==3.1.0) (3.1.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (2.26.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (3.0.5)
Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (0.6.0)
Requirement already satisfied: typer<0.5.0,>=0.3.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (0.4.0)
Requirement already satisfied: srsly<3.0.0,>=2.4.1 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (2.4.1)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (2.0.6)
Requirement already satisfied: typing-extensions<4.0.0.0,>=3.7.4 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (3.10.0.2)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (1.8.2)
Requirement already satisfied: thinc<8.1.0,>=8.0.9 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (8.0.10)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (4.62.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (3.0.1)
Requirement already satisfied: blis<0.8.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (0.7.4)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (59.1.1)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (21.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (2.0.5)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.8 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (3.0.8)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (1.0.5)
Requirement already satisfied: wasabi<1.1.0,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (0.8.2)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.7/dist-packages (from spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (1.21.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from catalogue<2.1.0,>=2.0.6->spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (3.5.0)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (2.4.7)
Requirement already satisfied: smart-open<6.0.0,>=5.0.0 in /usr/local/lib/python3.7/dist-packages (from pathy>=0.3.5->spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (5.2.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (2019.11.28)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.7/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (2.0.7)
Requirement already satisfied: idna<4,>=2.5 in /usr/lib/python3/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (2.8)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/lib/python3/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (1.25.8)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.7/dist-packages (from typer<0.5.0,>=0.3.0->spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (7.1.2)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.7/dist-packages (from jinja2->spacy<3.2.0,>=3.1.3->en-phone-nlp-2==3.1.0) (2.0.1)
Building wheels for collected packages: en-phone-nlp-2
  Building wheel for en-phone-nlp-2 (setup.py) ... done
  Created wheel for en-phone-nlp-2: filename=en_phone_nlp_2-3.1.0-py3-none-any.whl size=13618384 sha256=86d51e0f37b26c932184654dd4b82ad7475f0cf41d1e32499263b505a4ea151a
  Stored in directory: /path/to/dir/.cache/pip/wheels/04/04/8d/81eaf26a25f7dfa433e1be3ad7e6524e81e9a4172a3d9d0d06
Successfully built en-phone-nlp-2
Installing collected packages: en-phone-nlp-2
Successfully installed en-phone-nlp-2-3.1.0
WARNING: You are using pip version 21.3.1; however, version 22.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
phone_nlp_new = spacy.load('en_phone_nlp_2')
Pipeline Components before adding regex custom component:
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'phone_nlp_rules']

Entities tracked in phone_nlp_rules
['PHONE']
Pipeline Components after adding regex custom component:
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'phone_nlp_rules', 'phone_multitoken_regex_capture']
for i, doc in enumerate(phone_nlp_new.pipe(sample_phone_sentences,
             as_tuples=False
            )):
    print(f"\033[1mProcessing Sentence {i} \033[1m")
    displacy.render(doc,
                    style='ent',
                   )
Processing Sentence 0 
Processing Sentence 1 
Processing Sentence 2 
Processing Sentence 3 
If you want to talk more. Reach me at +1 (901)-985-4567 PHONE
If you want to talk more. Reach me at +1(901)-985-4567 PHONE
If you want to talk more. Reach me at (901) 985 4567 PHONE
If you want to talk more. Reach me at 9019854567 PHONE

5. Conclusion

  • By building ner rules models for MODEL_NAMES, PHONE and ADDRESS entities we discussed the following concepts:
    • Spacy’s Token Matcher, Phrase Matcher and our own custom component Regex Matcher
    • How to load Spacy patterns from a json file
    • How to chain NER entities
    • How to save and load ner pipeline with custome component

References

  • Spacy Rules based Matching | link