How to train a Spacy NER Model – Learn by Blogging

1. Introduction

TL;DR Summary of the Blog

How do we create ML NER Model from no labeled data?:

Prepare rules-bootstrapped training data from unlabeled corpus
- If it is possible/easy to annotate directly, one can do that.
- However, if rules taggging is possible
  - In the Disease NER dataset example here
  - there is an opportunity to use a huge list of words to tag via rules first
  - then labeling becomes easier than labeling from scratch
Rules-boostrapped data is then reviewed/edited by human annotators
Stratify Split the human-reviewed data into train-dev-test at sentences level
Optimize and Train one or more Spacy ML NER Models
Compare and Evaluate the accuracy of the models

Now, we can list the above steps with a DISEASE NER example …

We have a ncbi_disease dataset of 7295 sentences speaking of various entities of which disease entity is of focus for us. > E.g.: “Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor .”
> adenomatous polyposis coli tumour is a DISEASE entity
Source of this dataset: link
For the sake of the argument of this blog, we assume this dataset does not have labels.
In most real world datasets, we are not likely to have labeled data

Hence the below pipeline helps in building an ML model > Unlabeled Sentences speaking of various diseases
> Tag DISEASE NER via Rules using a huge list of disease words
> Review/Edit the Rules-bootstrapped NER (tagging NER from scratch is a lot tougher)
> Split the Data into train-dev-test
> Train an ML model on train and dev datasets and evaluate on unseen test dataset
> Evaluate & Compare the Rules Model (baseline) and Spacy ML NER models (built from spacy-small and roberta base)

2. Prepare Rules-bootstrapped Data from unlabeled data

2A. Loading the disease words from an external file

Code

import json
import re

with open('./spacy_model_ner/diseases_ner.json','r') as f:
    diseases_json = json.load(f)
    
list_of_diseases = [each['disease'] for each in diseases_json if not re.search('[,]|test',each['disease'],re.I)]

list_of_diseases.extend(['tumor','tumour']) #adding some custom word

list_of_diseases[0:10]

['Hemophilia',
 'Hemophilia A',
 'Hepatitis A',
 'Abdominal Aortic Aneurysm',
 'AAA',
 'Alpha 1 Antitrypsin Deficiency',
 'AAT',
 'AATD',
 'Scar Tissue',
 'Abdominal Adhesions']

2B. Convert Disease Words into Spacy Patterns

Code

import spacy
nlp = spacy.load('en_core_web_sm',disable=['ner']) #ner component is not needed; 

def list_of_words_2_spacy_patterns(list_of_words,
                                   nlp_model,
                                   label_name
                                  ):
    spacy_patterns = []
    for each_word in list_of_words:
        sub_pattern_list = [] # [{"ORTH": user_text_entity_df.loc[each_pattern_index,'TEXT']}]
        for token in  nlp_model(each_word.lower()):
            if re.search('^\W{1,}$',token.text):
                sub_pattern_list.append({"ORTH": token.text,"OP":"*"})
            else:
                sub_pattern_list.append({"LOWER":token.text})
        temp_dict = {"label": label_name,
                     "pattern": sub_pattern_list}
        spacy_patterns.append(temp_dict)
    return spacy_patterns

disease_spacy_rules_patterns = list_of_words_2_spacy_patterns(list_of_diseases,
                               nlp,
                               "DISEASE"
                              )

disease_spacy_rules_patterns[0:5]

[{'label': 'DISEASE', 'pattern': [{'LOWER': 'hemophilia'}]},
 {'label': 'DISEASE', 'pattern': [{'LOWER': 'hemophilia'}, {'LOWER': 'a'}]},
 {'label': 'DISEASE', 'pattern': [{'LOWER': 'hepatitis'}, {'LOWER': 'a'}]},
 {'label': 'DISEASE',
  'pattern': [{'LOWER': 'abdominal'},
   {'LOWER': 'aortic'},
   {'LOWER': 'aneurysm'}]},
 {'label': 'DISEASE', 'pattern': [{'LOWER': 'aaa'}]}]

2C. Create `Disease NER` out of spacy patterns

Code

def load_rules_nlp_model_from_spacy_patterns(spacy_patterns):
    rules_nlp = spacy.load('en_core_web_sm',disable=['ner'])
    rules_config = {
        "validate": True,
        "overwrite_ents": True,
    }

    disease_rules = rules_nlp.add_pipe("entity_ruler", # invoke entity_ruler pipe 
                                       "disease_rules", # give a name to the pipe
                                       config=rules_config)
    disease_rules.add_patterns(spacy_patterns)
    return rules_nlp

disease_ner_rules_nlp = load_rules_nlp_model_from_spacy_patterns(disease_spacy_rules_patterns)

print("The pipeline components are:")
print(disease_ner_rules_nlp.pipe_names)
print("NER entities tracked are:")
print(disease_ner_rules_nlp.pipe_labels['disease_rules'])

The pipeline components are:
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'disease_rules']
NER entities tracked are:
['DISEASE']

Want to know more about creating rules NER?
Refer the below blog article
learn_by_blogging/How_to_Leverage_Spacy_Rules_NER

2D. Infer Spacy Rules NER as token-level results

Code

import pandas as pd

## The token-level results from the above model on 7.2K sentences looks like below
token_level_rules_output_example = pd.read_csv('spacy_model_ner/token_level_tags_on_one_unlabeled_sentence.csv',index_col=False)
token_level_rules_output_example = token_level_rules_output_example[['New_Sentence_Id','Token','Rules_Tag_BIO']]
token_level_rules_output_example.columns = ['Ramdom_Sentence_Id','Token','Rules_Tag_BIO']

Code

token_level_rules_output_example

	Ramdom_Sentence_Id	Token	Rules_Tag_BIO
0	tr_0	Identification	O
1	tr_0	of	O
2	tr_0	APC2	O
3	tr_0	,	O
4	tr_0	a	O
5	tr_0	homologue	O
6	tr_0	of	O
7	tr_0	the	O
8	tr_0	adenomatous	B-DISEASE
9	tr_0	polyposis	I-DISEASE
10	tr_0	coli	I-DISEASE
11	tr_0	tumour	O
12	tr_0	suppressor	O
13	tr_0	.	O
14	tr_0	[SEP]	[SEP]

Ofcourse, there are mistakes in this rules_ner output like in row #11 where tumour is not tagged I-DISEASE
We rectify the mistakes of rules by human annotion

3. Human Review of the Rules Output

3A. Edit token-level results in csv by human annotation/review

Code

## The token-level results from the above model on 7.2K sentences looks like below
token_level_rules_plus_human_output_example = pd.read_csv('spacy_model_ner/token_level_tags_on_one_unlabeled_sentence_2.csv',index_col=False)
token_level_rules_plus_human_output_example

	New_Sentence_Id	Token	Rules_Tag_BIO	Human_Annotated_Tag_BIO
0	tr_0	Identification	O	O
1	tr_0	of	O	O
2	tr_0	of	O	O
3	tr_0	of	O	O
4	tr_0	of	O	O
5	tr_0	APC2	O	O
6	tr_0	,	O	O
7	tr_0	a	O	O
8	tr_0	homologue	O	O
9	tr_0	the	O	O
10	tr_0	adenomatous	B-DISEASE	B-DISEASE
11	tr_0	polyposis	I-DISEASE	I-DISEASE
12	tr_0	coli	I-DISEASE	I-DISEASE
13	tr_0	tumour	O	I-DISEASE
14	tr_0	suppressor	O	O
15	tr_0	.	O	O
16	tr_0	[SEP]	[SEP]	[SEP]

4. Stratify Split the Data based on human annotations

Code

sentence_level_count_df = pd.read_csv('spacy_model_ner/split_of_classes_sentence_level.csv',index_col=False)
print(f"Total number of Sentences: {sum(sentence_level_count_df['Count'])}")
sentence_level_count_df

Total number of Sentences: 7295

	Sentence_level_tags	Count	%_contribution
0	O	3337	46.0
1	B-DISEASE\|I-DISEASE\|O	2698	37.0
2	B-DISEASE\|O	1260	17.0

From the above table, we can infer that
- there are more multi-token diseases than single-token diseases - there are 3.3K sentences with only O as the token

We have to ensure all three splits - train, dev and test - have the same percentage of O, B-DISEASE|I-DISEASE|O and B-DISEASE|O

After spliting into train-dev-test in a 80-10-10 split …

Code

train_count_df = pd.read_csv('spacy_model_ner/split_of_classes_sentence_level_train.csv',index_col=False)
print(f"Total number of Train Sentences: {sum(train_count_df['Count'])}")
train_count_df

Total number of Train Sentences: 5836

	Sentence_level_tags	Count	%_contribution
0	O	2670	46.0
1	B-DISEASE\|I-DISEASE\|O	2158	37.0
2	B-DISEASE\|O	1008	17.0

Code

dev_count_df = pd.read_csv('spacy_model_ner/split_of_classes_sentence_level_dev.csv',index_col=False)
print(f"Total number of Train Sentences: {sum(dev_count_df['Count'])}")
dev_count_df

Total number of Train Sentences: 729

	Sentence_level_tags	Count	%_contribution
0	O	333	46.0
1	B-DISEASE\|I-DISEASE\|O	270	37.0
2	B-DISEASE\|O	126	17.0

Code

test_count_df = pd.read_csv('spacy_model_ner/split_of_classes_sentence_level_test.csv',index_col=False)
print(f"Total number of Train Sentences: {sum(test_count_df['Count'])}")
test_count_df

Total number of Train Sentences: 730

	Sentence_level_tags	Count	%_contribution
0	O	334	46.0
1	B-DISEASE\|I-DISEASE\|O	270	37.0
2	B-DISEASE\|O	126	17.0

5. Train ML Model

5A. Convert token-level Results 2 Spacy Model-acceptable Conll Data

Code

def convert_token_df_2_conll_string(token_tag_df,
                                    token_column_name,
                                    tag_column_name
                                   ):
    token_string = ''
    for each in range(len(token_tag_df)):
        if each %1000 == 0:
            print(f"{each} tokens processed")
        current_token_string = str(token_tag_df.loc[each,token_column_name])
        current_tag_string = str(token_tag_df.loc[each,tag_column_name])
        
        if current_token_string !='[SEP]':
            current_line = current_token_string + "\t" + current_tag_string + "\n"
        else:
            current_line = "\n"
        token_string = token_string + current_line
    return token_string

!tail -n 25 ../data/diease_ner/train_dev_test_split_conll_data/test_data.conll

investigate O
the O
rate    O
of  O
BRCA2   O
mutation    O
in  O
sporadic    B-DISEASE
breast  I-DISEASE
cancers I-DISEASE
and O
in  O
a   O
set O
of  O
cell    O
lines   O
that    O
represent   O
twelve  O
other   O
tumour  B-DISEASE
types   O
.   O

5B. Train a Spacy Small ML Model

CLI command for Spacy Model Training:

!python3 -m spacy train $CONFIG_DIR/original_spacy_small_ner_config.cfg \
--output $SPACY_SMALL_MODEL_DIR_GPU \
--paths.train $SPACY_DATA_DIR/train_data.spacy \
--paths.dev $SPACY_DATA_DIR/dev_data.spacy \
--verbose \
-g 0

The output from the Spacy Model Training:

[2022-05-24 07:09:39,410] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
ℹ Saving to output directory:
../data/model_weights/spacy_small
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2022-05-24 07:09:41,789] [INFO] Set up nlp object from config
[2022-05-24 07:09:41,797] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/dev_data.spacy
[2022-05-24 07:09:41,798] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/train_data.spacy
[2022-05-24 07:09:41,798] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-05-24 07:09:41,803] [INFO] Created vocabulary
[2022-05-24 07:09:41,804] [INFO] Finished initializing nlp object

[2022-05-24 07:09:51,839] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
[2022-05-24 07:09:51,852] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/dev_data.spacy
[2022-05-24 07:09:51,853] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/train_data.spacy
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     41.00    0.31    0.33    0.29    0.00
  0     200        195.83   1820.96   41.44   55.56   33.05    0.41             
  0     400        106.20   1131.87   47.27   66.58   36.64    0.47             
  0     600         64.99    969.69   74.94   77.17   72.84    0.75             
  0     800         87.66   1096.80   74.39   76.96   71.98    0.74             
  1    1000         82.51   1134.93   77.53   79.28   75.86    0.78             
  1    1200        113.61   1122.73   80.73   82.36   79.17    0.81             
  1    1400        128.84   1178.08   84.40   86.10   82.76    0.84             
...   
 26    5400        287.07    591.02   85.96   87.04   84.91    0.86             
 27    5600        382.16    540.78   86.21   87.56   84.91    0.86             
 28    5800        406.03    615.57   86.29   86.67   85.92    0.86             
Epoch 29:   0%|                                         | 0/200 [00:00<?, ?it/s]✔ Saved pipeline to output directory
../data/model_weights/spacy_small/model-last

Command for Evaluating the Model Results:

!python3 -m spacy evaluate $SPACY_SMALL_MODEL_DIR_GPU/model-best $SPACY_DATA_DIR/test_data.spacy \
--output $SPACY_SMALL_MODEL_DIR_GPU/model-best/spacy_small_model_evaluation.json \
--gpu-id 0

Output of Evaluate Command :

ℹ Using GPU: 0

================================== Results ==================================

TOK     -    
NER P   89.75
NER R   82.81
NER F   86.14
SPEED   20752


=============================== NER (per type) ===============================

              P       R       F
DISEASE   89.75   82.81   86.14

✔ Saved results to
../data/model_weights/spacy_small/model-best/spacy_small_model_evaluation.json

5C. Train a Spacy Roberta Base ML Model

CLI command for Spacy Model Training:

!python3 -m spacy train $CONFIG_DIR/original_trf_config.cfg \
--output $SPACY_ROBERTA_MODEL_DIR_GPU \
--paths.train $SPACY_DATA_DIR/train_data.spacy \
--paths.dev $SPACY_DATA_DIR/dev_data.spacy \
--verbose \
-g 0

The output from the Spacy Model Training:


[2022-05-24 07:44:08,351] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
✔ Created output directory:
../data/model_weights/spacy_roberta_base_
ℹ Saving to output directory:
../data/model_weights/spacy_roberta_base_
ℹ Using GPU: 0

=========================== Initializing pipeline ===========================
[2022-05-24 07:44:11,169] [INFO] Set up nlp object from config
[2022-05-24 07:44:11,178] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/dev_data.spacy
[2022-05-24 07:44:11,180] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/train_data.spacy
[2022-05-24 07:44:11,180] [INFO] Pipeline: ['transformer', 'ner']
[2022-05-24 07:44:11,184] [INFO] Created vocabulary
[2022-05-24 07:44:11,185] [INFO] Finished initializing nlp object

[2022-05-24 07:44:22,286] [INFO] Initialized pipeline components: ['transformer', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
[2022-05-24 07:44:22,298] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/dev_data.spacy
[2022-05-24 07:44:22,299] [DEBUG] Loading corpus from path: ../data/diease_ner/train_dev_test_split_spacy_binary/train_data.spacy
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 0.0
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0        4392.55    285.04    0.21    0.17    0.29    0.00
  1     200      126102.97  33471.51   84.94   81.78   88.36    0.85
  3     400        1782.34   2642.55   89.50   90.83   88.22    0.90
  5     600        1123.23   1596.50   90.69   89.06   92.39    0.91
...
 27    2800         153.06    176.41   90.94   90.35   91.52    0.91
 29    3000         103.80    128.69   90.86   88.98   92.82    0.91
 30    3200         121.04    141.70   91.47   90.56   92.39    0.91
 32    3400          90.05    116.95   90.51   89.93   91.09    0.91
 34    3600         111.25    131.25   91.12   90.15   92.10    0.91
 36    3800          79.87     82.69   90.30   89.66   90.95    0.90
 38    4000          82.07     82.97   90.97   90.01   91.95    0.91
✔ Saved pipeline to output directory
../data/model_weights/spacy_roberta_base_/model-last
CPU times: user 11.1 s, sys: 2.78 s, total: 13.9 s
Wall time: 22min 37s

Command for Evaluating the Model Results:

!python3 -m spacy evaluate $SPACY_SMALL_MODEL_DIR_GPU/model-best $SPACY_DATA_DIR/test_data.spacy \
--output $SPACY_SMALL_MODEL_DIR_GPU/model-best/spacy_small_model_evaluation.json \
--gpu-id 0

Output of Evaluate Command :

ℹ Using GPU: 0

================================== Results ==================================

TOK     -    
NER P   88.61
NER R   90.26
NER F   89.43
SPEED   12020


=============================== NER (per type) ===============================

              P       R       F
DISEASE   88.61   90.26   89.43

✔ Saved results to
../data/model_weights/spacy_roberta_base_/model-best/spacy_roberta_base_evaluation.json

6. Evaluate the models

6A. Comparing the entity-level F1-score of (1) Rules, (2) Spacy small and (3) Spacy Roberta-base model

results = pd.read_csv('spacy_model_ner/results_of_the_models.csv',index_col=0)

results

	Precision	Recall	F1_Score
Rules_Model	39.85	2.21	28.52
Spacy_Small_Model	89.75	82.81	86.14
Spacy_Roberta_Base_Model	88.61	90.26	89.43

7. Conclusion

In this blog article, we have shown how to effectively build a NER model on unlabeled data.
- We have compared Rules NER, a Spacy Small NER and roberta-base NER models. - We found roberta_base model is having the highest F1 score of 89%.
- We can also ensemble results of Spacy Small NER and Spacy Roberta Base NER models.
- A digression from the scope of this article: There are umpteen good tools (paid mostly) aiding the annotation. Sometimes, for simple NER problem (like tagging only one entity like this Disease NER), even excel is good for annotation

References: - https://spacy.io/api/cli#evaluate - If you would like to replicate the above results, refer to the DiseaseNER notebooks in this link

1. Introduction

TL;DR Summary of the Blog

2. Prepare Rules-bootstrapped Data from unlabeled data

2A. Loading the disease words from an external file

2B. Convert Disease Words into Spacy Patterns

2C. Create Disease NER out of spacy patterns

2D. Infer Spacy Rules NER as token-level results

3. Human Review of the Rules Output

3A. Edit token-level results in csv by human annotation/review

4. Stratify Split the Data based on human annotations

5. Train ML Model

5A. Convert token-level Results 2 Spacy Model-acceptable Conll Data

5B. Train a Spacy Small ML Model

5C. Train a Spacy Roberta Base ML Model

6. Evaluate the models

6A. Comparing the entity-level F1-score of (1) Rules, (2) Spacy small and (3) Spacy Roberta-base model

7. Conclusion

2C. Create `Disease NER` out of spacy patterns