In this blog post, I cover the process of creating trained ML NER model from Unlabeled data
NLP
DL
Coding
Python
Author
Senthil Kumar
Published
June 25, 2021
1. Introduction
TL;DR Summary of the Blog
How do we create ML NER Model from no labeled data?:
Prepare rules-bootstrapped training data from unlabeled corpus
If it is possible/easy to annotate directly, one can do that.
However, if rules taggging is possible
In the Disease NER dataset example here
there is an opportunity to use a huge list of words to tag via rules first
then labeling becomes easier than labeling from scratch
Rules-boostrapped data is then reviewed/edited by human annotators
Stratify Split the human-reviewed data into train-dev-test at sentences level
Optimize and Train one or more Spacy ML NER Models
Compare and Evaluate the accuracy of the models
Now, we can list the above steps with a DISEASE NER example …
We have a ncbi_disease dataset of 7295 sentences speaking of various entities of which disease entity is of focus for us. > E.g.: “Identification of APC2 , a homologue of the adenomatous polyposis coli tumour suppressor .” > adenomatous polyposis coli tumour is a DISEASE entity
For the sake of the argument of this blog, we assume this dataset does not have labels.
In most real world datasets, we are not likely to have labeled data
Hence the below pipeline helps in building an ML model > Unlabeled Sentences speaking of various diseases > Tag DISEASE NER via Rules using a huge list of disease words > Review/Edit the Rules-bootstrapped NER (tagging NER from scratch is a lot tougher) > Split the Data into train-dev-test > Train an ML model on train and dev datasets and evaluate on unseen test dataset > Evaluate & Compare the Rules Model (baseline) and Spacy ML NER models (built from spacy-small and roberta base)
2. Prepare Rules-bootstrapped Data from unlabeled data
2A. Loading the disease words from an external file
Code
import jsonimport rewithopen('./spacy_model_ner/diseases_ner.json','r') as f: diseases_json = json.load(f)list_of_diseases = [each['disease'] for each in diseases_json ifnot re.search('[,]|test',each['disease'],re.I)]list_of_diseases.extend(['tumor','tumour']) #adding some custom wordlist_of_diseases[0:10]
import pandas as pd## The token-level results from the above model on 7.2K sentences looks like belowtoken_level_rules_output_example = pd.read_csv('spacy_model_ner/token_level_tags_on_one_unlabeled_sentence.csv',index_col=False)token_level_rules_output_example = token_level_rules_output_example[['New_Sentence_Id','Token','Rules_Tag_BIO']]token_level_rules_output_example.columns = ['Ramdom_Sentence_Id','Token','Rules_Tag_BIO']
Code
token_level_rules_output_example
Ramdom_Sentence_Id
Token
Rules_Tag_BIO
0
tr_0
Identification
O
1
tr_0
of
O
2
tr_0
APC2
O
3
tr_0
,
O
4
tr_0
a
O
5
tr_0
homologue
O
6
tr_0
of
O
7
tr_0
the
O
8
tr_0
adenomatous
B-DISEASE
9
tr_0
polyposis
I-DISEASE
10
tr_0
coli
I-DISEASE
11
tr_0
tumour
O
12
tr_0
suppressor
O
13
tr_0
.
O
14
tr_0
[SEP]
[SEP]
Ofcourse, there are mistakes in this rules_ner output like in row #11 where tumour is not tagged I-DISEASE
We rectify the mistakes of rules by human annotion
3. Human Review of the Rules Output
3A. Edit token-level results in csv by human annotation/review
Code
## The token-level results from the above model on 7.2K sentences looks like belowtoken_level_rules_plus_human_output_example = pd.read_csv('spacy_model_ner/token_level_tags_on_one_unlabeled_sentence_2.csv',index_col=False)token_level_rules_plus_human_output_example
New_Sentence_Id
Token
Rules_Tag_BIO
Human_Annotated_Tag_BIO
0
tr_0
Identification
O
O
1
tr_0
of
O
O
2
tr_0
of
O
O
3
tr_0
of
O
O
4
tr_0
of
O
O
5
tr_0
APC2
O
O
6
tr_0
,
O
O
7
tr_0
a
O
O
8
tr_0
homologue
O
O
9
tr_0
the
O
O
10
tr_0
adenomatous
B-DISEASE
B-DISEASE
11
tr_0
polyposis
I-DISEASE
I-DISEASE
12
tr_0
coli
I-DISEASE
I-DISEASE
13
tr_0
tumour
O
I-DISEASE
14
tr_0
suppressor
O
O
15
tr_0
.
O
O
16
tr_0
[SEP]
[SEP]
[SEP]
4. Stratify Split the Data based on human annotations
Code
sentence_level_count_df = pd.read_csv('spacy_model_ner/split_of_classes_sentence_level.csv',index_col=False)print(f"Total number of Sentences: {sum(sentence_level_count_df['Count'])}")sentence_level_count_df
Total number of Sentences: 7295
Sentence_level_tags
Count
%_contribution
0
O
3337
46.0
1
B-DISEASE|I-DISEASE|O
2698
37.0
2
B-DISEASE|O
1260
17.0
From the above table, we can infer that - there are more multi-token diseases than single-token diseases - there are 3.3K sentences with only O as the token
We have to ensure all three splits - train, dev and test - have the same percentage of O, B-DISEASE|I-DISEASE|O and B-DISEASE|O
After spliting into train-dev-test in a 80-10-10 split …
Code
train_count_df = pd.read_csv('spacy_model_ner/split_of_classes_sentence_level_train.csv',index_col=False)print(f"Total number of Train Sentences: {sum(train_count_df['Count'])}")train_count_df
Total number of Train Sentences: 5836
Sentence_level_tags
Count
%_contribution
0
O
2670
46.0
1
B-DISEASE|I-DISEASE|O
2158
37.0
2
B-DISEASE|O
1008
17.0
Code
dev_count_df = pd.read_csv('spacy_model_ner/split_of_classes_sentence_level_dev.csv',index_col=False)print(f"Total number of Train Sentences: {sum(dev_count_df['Count'])}")dev_count_df
Total number of Train Sentences: 729
Sentence_level_tags
Count
%_contribution
0
O
333
46.0
1
B-DISEASE|I-DISEASE|O
270
37.0
2
B-DISEASE|O
126
17.0
Code
test_count_df = pd.read_csv('spacy_model_ner/split_of_classes_sentence_level_test.csv',index_col=False)print(f"Total number of Train Sentences: {sum(test_count_df['Count'])}")test_count_df
Total number of Train Sentences: 730
Sentence_level_tags
Count
%_contribution
0
O
334
46.0
1
B-DISEASE|I-DISEASE|O
270
37.0
2
B-DISEASE|O
126
17.0
5. Train ML Model
5A. Convert token-level Results 2 Spacy Model-acceptable Conll Data
investigate O
the O
rate O
of O
BRCA2 O
mutation O
in O
sporadic B-DISEASE
breast I-DISEASE
cancers I-DISEASE
and O
in O
a O
set O
of O
cell O
lines O
that O
represent O
twelve O
other O
tumour B-DISEASE
types O
. O
ℹ Using GPU: 0
================================== Results ==================================
TOK -
NER P 89.75
NER R 82.81
NER F 86.14
SPEED 20752
=============================== NER (per type) ===============================
P R F
DISEASE 89.75 82.81 86.14
✔ Saved results to
../data/model_weights/spacy_small/model-best/spacy_small_model_evaluation.json
ℹ Using GPU: 0
================================== Results ==================================
TOK -
NER P 88.61
NER R 90.26
NER F 89.43
SPEED 12020
=============================== NER (per type) ===============================
P R F
DISEASE 88.61 90.26 89.43
✔ Saved results to
../data/model_weights/spacy_roberta_base_/model-best/spacy_roberta_base_evaluation.json
6. Evaluate the models
6A. Comparing the entity-level F1-score of (1) Rules, (2) Spacy small and (3) Spacy Roberta-base model
In this blog article, we have shown how to effectively build a NER model on unlabeled data. - We have compared Rules NER, a Spacy Small NER and roberta-base NER models. - We found roberta_base model is having the highest F1 score of 89%.
- We can also ensemble results of Spacy Small NER and Spacy Roberta Base NER models.
- A digression from the scope of this article: There are umpteen good tools (paid mostly) aiding the annotation. Sometimes, for simple NER problem (like tagging only one entity like this Disease NER), even excel is good for annotation
References: - https://spacy.io/api/cli#evaluate - If you would like to replicate the above results, refer to the DiseaseNER notebooks in this link