How to train a Spacy NER Model

In this blog post, I cover the process of creating trained ML NER model from Unlabeled data
NLP
DL
Coding
Python
Author

Senthil Kumar

Published

June 25, 2021


1. Introduction

TL;DR Summary of the Blog

How do we create ML NER Model from no labeled data?:

  • Prepare rules-bootstrapped training data from unlabeled corpus
    • If it is possible/easy to annotate directly, one can do that.
    • However, if rules taggging is possible
      • In the Disease NER dataset example here
      • there is an opportunity to use a huge list of words to tag via rules first
      • then labeling becomes easier than labeling from scratch
  • Rules-boostrapped data is then reviewed/edited by human annotators
  • Stratify Split the human-reviewed data into train-dev-test at sentences level
  • Optimize and Train one or more Spacy ML NER Models
  • Compare and Evaluate the accuracy of the models

Now, we can list the above steps with a DISEASE NER example …

  • We have a ncbi_disease dataset of 7295 sentences speaking of various entities of which disease entity is of focus for us. > E.g.: “Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor .”
    > adenomatous polyposis coli tumour is a DISEASE entity

  • Source of this dataset: link

  • For the sake of the argument of this blog, we assume this dataset does not have labels.

  • In most real world datasets, we are not likely to have labeled data


  • Hence the below pipeline helps in building an ML model > Unlabeled Sentences speaking of various diseases
    > Tag DISEASE NER via Rules using a huge list of disease words
    > Review/Edit the Rules-bootstrapped NER (tagging NER from scratch is a lot tougher)
    > Split the Data into train-dev-test
    > Train an ML model on train and dev datasets and evaluate on unseen test dataset
    > Evaluate & Compare the Rules Model (baseline) and Spacy ML NER models (built from spacy-small and roberta base)

2. Prepare Rules-bootstrapped Data from unlabeled data

2A. Loading the disease words from an external file

Code
import json
import re

with open('./spacy_model_ner/diseases_ner.json','r') as f:
    diseases_json = json.load(f)
    
list_of_diseases = [each['disease'] for each in diseases_json if not re.search('[,]|test',each['disease'],re.I)]

list_of_diseases.extend(['tumor','tumour']) #adding some custom word

list_of_diseases[0:10]
['Hemophilia',
 'Hemophilia A',
 'Hepatitis A',
 'Abdominal Aortic Aneurysm',
 'AAA',
 'Alpha 1 Antitrypsin Deficiency',
 'AAT',
 'AATD',
 'Scar Tissue',
 'Abdominal Adhesions']

2B. Convert Disease Words into Spacy Patterns

Code
import spacy
nlp = spacy.load('en_core_web_sm',disable=['ner']) #ner component is not needed; 

def list_of_words_2_spacy_patterns(list_of_words,
                                   nlp_model,
                                   label_name
                                  ):
    spacy_patterns = []
    for each_word in list_of_words:
        sub_pattern_list = [] # [{"ORTH": user_text_entity_df.loc[each_pattern_index,'TEXT']}]
        for token in  nlp_model(each_word.lower()):
            if re.search('^\W{1,}$',token.text):
                sub_pattern_list.append({"ORTH": token.text,"OP":"*"})
            else:
                sub_pattern_list.append({"LOWER":token.text})
        temp_dict = {"label": label_name,
                     "pattern": sub_pattern_list}
        spacy_patterns.append(temp_dict)
    return spacy_patterns

disease_spacy_rules_patterns = list_of_words_2_spacy_patterns(list_of_diseases,
                               nlp,
                               "DISEASE"
                              )

disease_spacy_rules_patterns[0:5]
[{'label': 'DISEASE', 'pattern': [{'LOWER': 'hemophilia'}]},
 {'label': 'DISEASE', 'pattern': [{'LOWER': 'hemophilia'}, {'LOWER': 'a'}]},
 {'label': 'DISEASE', 'pattern': [{'LOWER': 'hepatitis'}, {'LOWER': 'a'}]},
 {'label': 'DISEASE',
  'pattern': [{'LOWER': 'abdominal'},
   {'LOWER': 'aortic'},
   {'LOWER': 'aneurysm'}]},
 {'label': 'DISEASE', 'pattern': [{'LOWER': 'aaa'}]}]

2C. Create Disease NER out of spacy patterns

Code
def load_rules_nlp_model_from_spacy_patterns(spacy_patterns):
    rules_nlp = spacy.load('en_core_web_sm',disable=['ner'])
    rules_config = {
        "validate": True,
        "overwrite_ents": True,
    }

    disease_rules = rules_nlp.add_pipe("entity_ruler", # invoke entity_ruler pipe 
                                       "disease_rules", # give a name to the pipe
                                       config=rules_config)
    disease_rules.add_patterns(spacy_patterns)
    return rules_nlp

disease_ner_rules_nlp = load_rules_nlp_model_from_spacy_patterns(disease_spacy_rules_patterns)

print("The pipeline components are:")
print(disease_ner_rules_nlp.pipe_names)
print("NER entities tracked are:")
print(disease_ner_rules_nlp.pipe_labels['disease_rules'])
The pipeline components are:
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'disease_rules']
NER entities tracked are:
['DISEASE']

2D. Infer Spacy Rules NER as token-level results

Code
import pandas as pd

## The token-level results from the above model on 7.2K sentences looks like below
token_level_rules_output_example = pd.read_csv('spacy_model_ner/token_level_tags_on_one_unlabeled_sentence.csv',index_col=False)
token_level_rules_output_example = token_level_rules_output_example[['New_Sentence_Id','Token','Rules_Tag_BIO']]
token_level_rules_output_example.columns = ['Ramdom_Sentence_Id','Token','Rules_Tag_BIO']
Code
token_level_rules_output_example
Ramdom_Sentence_Id Token Rules_Tag_BIO
0 tr_0 Identification O
1 tr_0 of O
2 tr_0 APC2 O
3 tr_0 , O
4 tr_0 a O
5 tr_0 homologue O
6 tr_0 of O
7 tr_0 the O
8 tr_0 adenomatous B-DISEASE
9 tr_0 polyposis I-DISEASE
10 tr_0 coli I-DISEASE
11 tr_0 tumour O
12 tr_0 suppressor O
13 tr_0 . O
14 tr_0 [SEP] [SEP]
  • Ofcourse, there are mistakes in this rules_ner output like in row #11 where tumour is not tagged I-DISEASE
  • We rectify the mistakes of rules by human annotion

3. Human Review of the Rules Output

3A. Edit token-level results in csv by human annotation/review

Code
## The token-level results from the above model on 7.2K sentences looks like below
token_level_rules_plus_human_output_example = pd.read_csv('spacy_model_ner/token_level_tags_on_one_unlabeled_sentence_2.csv',index_col=False)
token_level_rules_plus_human_output_example
New_Sentence_Id Token Rules_Tag_BIO Human_Annotated_Tag_BIO
0 tr_0 Identification O O
1 tr_0 of O O
2 tr_0 of O O
3 tr_0 of O O
4 tr_0 of O O
5 tr_0 APC2 O O
6 tr_0 , O O
7 tr_0 a O O
8 tr_0 homologue O O
9 tr_0 the O O
10 tr_0 adenomatous B-DISEASE B-DISEASE
11 tr_0 polyposis I-DISEASE I-DISEASE
12 tr_0 coli I-DISEASE I-DISEASE
13 tr_0 tumour O I-DISEASE
14 tr_0 suppressor O O
15 tr_0 . O O
16 tr_0 [SEP] [SEP] [SEP]

4. Stratify Split the Data based on human annotations

Code
sentence_level_count_df = pd.read_csv('spacy_model_ner/split_of_classes_sentence_level.csv',index_col=False)
print(f"Total number of Sentences: {sum(sentence_level_count_df['Count'])}")
sentence_level_count_df
Total number of Sentences: 7295
Sentence_level_tags Count %_contribution
0 O 3337 46.0
1 B-DISEASE|I-DISEASE|O 2698 37.0
2 B-DISEASE|O 1260 17.0

From the above table, we can infer that
- there are more multi-token diseases than single-token diseases - there are 3.3K sentences with only O as the token

We have to ensure all three splits - train, dev and test - have the same percentage of O, B-DISEASE|I-DISEASE|O and B-DISEASE|O

After spliting into train-dev-test in a 80-10-10 split …

Code
train_count_df = pd.read_csv('spacy_model_ner/split_of_classes_sentence_level_train.csv',index_col=False)
print(f"Total number of Train Sentences: {sum(train_count_df['Count'])}")
train_count_df
Total number of Train Sentences: 5836
Sentence_level_tags Count %_contribution
0 O 2670 46.0
1 B-DISEASE|I-DISEASE|O 2158 37.0
2 B-DISEASE|O 1008 17.0
Code
dev_count_df = pd.read_csv('spacy_model_ner/split_of_classes_sentence_level_dev.csv',index_col=False)
print(f"Total number of Train Sentences: {sum(dev_count_df['Count'])}")
dev_count_df
Total number of Train Sentences: 729
Sentence_level_tags Count %_contribution
0 O 333 46.0
1 B-DISEASE|I-DISEASE|O 270 37.0
2 B-DISEASE|O 126 17.0
Code
test_count_df = pd.read_csv('spacy_model_ner/split_of_classes_sentence_level_test.csv',index_col=False)
print(f"Total number of Train Sentences: {sum(test_count_df['Count'])}")
test_count_df
Total number of Train Sentences: 730
Sentence_level_tags Count %_contribution
0 O 334 46.0
1 B-DISEASE|I-DISEASE|O 270 37.0
2 B-DISEASE|O 126 17.0

5. Train ML Model

5A. Convert token-level Results 2 Spacy Model-acceptable Conll Data

Code
def convert_token_df_2_conll_string(token_tag_df,
                                    token_column_name,
                                    tag_column_name
                                   ):
    token_string = ''
    for each in range(len(token_tag_df)):
        if each %1000 == 0:
            print(f"{each} tokens processed")
        current_token_string = str(token_tag_df.loc[each,token_column_name])
        current_tag_string = str(token_tag_df.loc[each,tag_column_name])
        
        if current_token_string !='[SEP]':
            current_line = current_token_string + "\t" + current_tag_string + "\n"
        else:
            current_line = "\n"
        token_string = token_string + current_line
    return token_string
!tail -n 25 ../data/diease_ner/train_dev_test_split_conll_data/test_data.conll
investigate O
the O
rate    O
of  O
BRCA2   O
mutation    O
in  O
sporadic    B-DISEASE
breast  I-DISEASE
cancers I-DISEASE
and O
in  O
a   O
set O
of  O
cell    O
lines   O
that    O
represent   O
twelve  O
other   O
tumour  B-DISEASE
types   O
.   O

5B. Train a Spacy Small ML Model

CLI command for Spacy Model Training:

!python3 -m spacy train $CONFIG_DIR/original_spacy_small_ner_config.cfg \
--output $SPACY_SMALL_MODEL_DIR_GPU \
--paths.train $SPACY_DATA_DIR/train_data.spacy \
--paths.dev $SPACY_DATA_DIR/dev_data.spacy \
--verbose \
-g 0

The output from the Spacy Model Training:

[2022-05-24 07:09:39,410] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
ℹ Saving to output directory:
../data/model_weights/spacy_small
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2022-05-24 07:09:41,789] [INFO] Set up nlp object from config
[2022-05-24 07:09:41,797] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/dev_data.spacy
[2022-05-24 07:09:41,798] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/train_data.spacy
[2022-05-24 07:09:41,798] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-05-24 07:09:41,803] [INFO] Created vocabulary
[2022-05-24 07:09:41,804] [INFO] Finished initializing nlp object

[2022-05-24 07:09:51,839] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
[2022-05-24 07:09:51,852] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/dev_data.spacy
[2022-05-24 07:09:51,853] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/train_data.spacy
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     41.00    0.31    0.33    0.29    0.00
  0     200        195.83   1820.96   41.44   55.56   33.05    0.41             
  0     400        106.20   1131.87   47.27   66.58   36.64    0.47             
  0     600         64.99    969.69   74.94   77.17   72.84    0.75             
  0     800         87.66   1096.80   74.39   76.96   71.98    0.74             
  1    1000         82.51   1134.93   77.53   79.28   75.86    0.78             
  1    1200        113.61   1122.73   80.73   82.36   79.17    0.81             
  1    1400        128.84   1178.08   84.40   86.10   82.76    0.84             
...   
 26    5400        287.07    591.02   85.96   87.04   84.91    0.86             
 27    5600        382.16    540.78   86.21   87.56   84.91    0.86             
 28    5800        406.03    615.57   86.29   86.67   85.92    0.86             
Epoch 29:   0%|                                         | 0/200 [00:00<?, ?it/s]✔ Saved pipeline to output directory
../data/model_weights/spacy_small/model-last

Command for Evaluating the Model Results:

!python3 -m spacy evaluate $SPACY_SMALL_MODEL_DIR_GPU/model-best $SPACY_DATA_DIR/test_data.spacy \
--output $SPACY_SMALL_MODEL_DIR_GPU/model-best/spacy_small_model_evaluation.json \
--gpu-id 0

Output of Evaluate Command :

ℹ Using GPU: 0

================================== Results ==================================

TOK     -    
NER P   89.75
NER R   82.81
NER F   86.14
SPEED   20752


=============================== NER (per type) ===============================

              P       R       F
DISEASE   89.75   82.81   86.14

✔ Saved results to
../data/model_weights/spacy_small/model-best/spacy_small_model_evaluation.json

5C. Train a Spacy Roberta Base ML Model

CLI command for Spacy Model Training:

!python3 -m spacy train $CONFIG_DIR/original_trf_config.cfg \
--output $SPACY_ROBERTA_MODEL_DIR_GPU \
--paths.train $SPACY_DATA_DIR/train_data.spacy \
--paths.dev $SPACY_DATA_DIR/dev_data.spacy \
--verbose \
-g 0

The output from the Spacy Model Training:


[2022-05-24 07:44:08,351] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
✔ Created output directory:
../data/model_weights/spacy_roberta_base_
ℹ Saving to output directory:
../data/model_weights/spacy_roberta_base_
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2022-05-24 07:44:11,169] [INFO] Set up nlp object from config
[2022-05-24 07:44:11,178] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/dev_data.spacy
[2022-05-24 07:44:11,180] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/train_data.spacy
[2022-05-24 07:44:11,180] [INFO] Pipeline: ['transformer', 'ner']
[2022-05-24 07:44:11,184] [INFO] Created vocabulary
[2022-05-24 07:44:11,185] [INFO] Finished initializing nlp object

[2022-05-24 07:44:22,286] [INFO] Initialized pipeline components: ['transformer', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
[2022-05-24 07:44:22,298] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/dev_data.spacy
[2022-05-24 07:44:22,299] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/train_data.spacy
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0        4392.55    285.04    0.21    0.17    0.29    0.00
  1     200      126102.97  33471.51   84.94   81.78   88.36    0.85
  3     400        1782.34   2642.55   89.50   90.83   88.22    0.90
  5     600        1123.23   1596.50   90.69   89.06   92.39    0.91
...
 27    2800         153.06    176.41   90.94   90.35   91.52    0.91
 29    3000         103.80    128.69   90.86   88.98   92.82    0.91
 30    3200         121.04    141.70   91.47   90.56   92.39    0.91
 32    3400          90.05    116.95   90.51   89.93   91.09    0.91
 34    3600         111.25    131.25   91.12   90.15   92.10    0.91
 36    3800          79.87     82.69   90.30   89.66   90.95    0.90
 38    4000          82.07     82.97   90.97   90.01   91.95    0.91
✔ Saved pipeline to output directory
../data/model_weights/spacy_roberta_base_/model-last
CPU times: user 11.1 s, sys: 2.78 s, total: 13.9 s
Wall time: 22min 37s

Command for Evaluating the Model Results:

!python3 -m spacy evaluate $SPACY_SMALL_MODEL_DIR_GPU/model-best $SPACY_DATA_DIR/test_data.spacy \
--output $SPACY_SMALL_MODEL_DIR_GPU/model-best/spacy_small_model_evaluation.json \
--gpu-id 0

Output of Evaluate Command :

ℹ Using GPU: 0

================================== Results ==================================

TOK     -    
NER P   88.61
NER R   90.26
NER F   89.43
SPEED   12020


=============================== NER (per type) ===============================

              P       R       F
DISEASE   88.61   90.26   89.43

✔ Saved results to
../data/model_weights/spacy_roberta_base_/model-best/spacy_roberta_base_evaluation.json

6. Evaluate the models

6A. Comparing the entity-level F1-score of (1) Rules, (2) Spacy small and (3) Spacy Roberta-base model

results = pd.read_csv('spacy_model_ner/results_of_the_models.csv',index_col=0)
results
Precision Recall F1_Score
Rules_Model 39.85 2.21 28.52
Spacy_Small_Model 89.75 82.81 86.14
Spacy_Roberta_Base_Model 88.61 90.26 89.43

7. Conclusion

In this blog article, we have shown how to effectively build a NER model on unlabeled data.
- We have compared Rules NER, a Spacy Small NER and roberta-base NER models. - We found roberta_base model is having the highest F1 score of 89%.
- We can also ensemble results of Spacy Small NER and Spacy Roberta Base NER models.
- A digression from the scope of this article: There are umpteen good tools (paid mostly) aiding the annotation. Sometimes, for simple NER problem (like tagging only one entity like this Disease NER), even excel is good for annotation

References: - https://spacy.io/api/cli#evaluate - If you would like to replicate the above results, refer to the DiseaseNER notebooks in this link