This blog post explains the use of PyTorch for building a bow-based Text Classifier
NLP
Coding
Author
Senthil Kumar
Published
August 28, 2021
1. Introduction
Why NLP has grown in recent years? - Because of the improvement in the ability of Language Models (such as BERT or GPT-3) to accurately understand human language - Easy to train these LMs as they learn from performing unsupervised pretraining tasks
What are the common types of NLP Applications for which NNs are built? - Text Classification | E.g.: Email Spam classification, Intent Classification of incomming messages in Chatbots - Sentiment Analysis | A regression task (outputs a number from most negative -1 to most positive +1 | Note: Training data needs to have outputs in range too) - NER | a component of Information Retrieval | We classify every token (typically tokens that are proper nouns) a pre-defined entity which is then used for some downstream - NER and Intent Classification can be used together with intent classification - E.g.: “Ok Google, Search apartments in Thoraipakam” - Intent: Search | Entity_1 (search_entity) apartments | Entity_2 (search_filter_location) Thoraipakkam - Text Summarization - Question-Answer Systems | Typicall Closed domain system where in the answer to a question is in the context - Context: “Joe Biden became US President in 2021 succedding Donald Trump” - Query: “Who was the President of the US before Joe Biden”
In this blog piece, let us cover - text classification task using a bow based vectorizer + nn.Linear layer
2.Representing Text as Tensors - A Quick Introduction
How do computers represent text? - Using encodings such as ASCII values to represent each character
Still computers cannot interpret the meaning of the words , they just represent text as ascii numbers in the above image
How is text converted into embeddings?
Two types of representations to convert text into numbers
Character-level representation
Word-level representation
Token or sub-word level representation
While Character-level and Word-level representations are self explanatory, Token-level representation is a combination of the above two approaches.
Some important terms:
Tokenization (sentence/text –> tokens): In the case sub-word level representations, for example, unfriendly will be tokenized as un, #friend, #ly where # indicates the token is a continuation of previous token.
This way of tokenization can make the model learnt/trained representations for friend and unfriendly to be closer to each other in the vector spacy
Numericalization (tokens –> numericals): This is the step where we convert tokens into integers.
Vectorization (numericals –> vectors): This is the process of creating vectors (typically sparse and equal to the length of the vocabulary of the corpus analyzed)
Embedding (numericals –> embeddings): For text data, embedding is a lower dimensional equivalent of a higher dimensional sparse vector. Embeddings are typically dense. Vectors are sparse.
Typical Process of Embedding Creation - text_data >> tokens >> numericals >> sparse vectors or dense embeddings
3. A Text Classification Pipeline to build BoW Classifier
Dataset considered: AG_NEWS dataset that consists of 4 classes - World, Sports, Business and Sci/Tech
┣━━ 1.Loading dataset ┃ ┣━━ torch.data.utils.datasets.AG_NEWS ┣━━ 2.Load Tokenization ┃ ┣━━ torchtext.data.utils.get_tokenizer('basic_english') ┣━━ 3.Build vocabulary ┃ ┣━━ torchtext.vocab.build_vocab_from_iterator(train_iterator) ┣━━ 4.Create BoW supporting functions ┃ ┣━━ Convert text_2_BoW_vector ┃ ┣━━ Create collate_fn to create a pair of label-feature tensors for every minibatch ┣━━ 5.Create train, validation and test DataLoaders ┣━━ 6.Define Model_Architecture ┣━━ 7.define training_loop and testing_loop functions ┣━━ 8.Train the model and Evaluate on Test Data ┣━━ 9.Test the model on sample text
for label, text in random.sample(train_dataset, 3):print(label,classes[label-1])print(text)print("******")
1 World
Burgers for the Health Professional Even as obesity and its consequences are increasingly taxing the health care system, fast food places are serving as hospital cafeterias.
******
4 Sci/Tech
Climate Talks Bring Bush #39;s Policy to Fore Glaciers in the Antarctic and in Greenland are melting much faster than expected, and the fastest moving glacier in the world has doubled its speed.
******
3 Business
Bush Health Savings Accounts Slow to Gain Acceptance So far employers and their workers have been slow to accept health savings accounts as an alternative to conventional health insurance.
******
3.4. Creating BoW related functions
The text pipeline purpose is to convert text into tokens
In Bag of Words (BOW) representation, - each word is linked to a vector index - where the vector value in that index is the frequency of occurrence of the word in the given document
Source: Microsoft Docs
3.4.1 Creating text_2_bow_vector
def to_bow(text, bow_vocab_size=vocab_size ): res = torch.zeros(bow_vocab_size,dtype=torch.float32)for i in _text_pipeline(text):if i<bow_vocab_size: res[i] +=1return resprint(f"sample text:\n{train_dataset[0][1]}")print(f"\nBoW vector:\n{to_bow(train_dataset[0][1])}")
sample text:
Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
BoW vector:
tensor([0., 2., 1., ..., 0., 0., 0.])
3.4.2 Create Collate Function
# the collate function# this collate function gets list of batch_size tuples, and needs to # return a pair of label-feature tensors for the whole minibatchdef bowify(b):return ( torch.tensor([t[0]-1for t in b],dtype=torch.float32), torch.stack([to_bow(t[1]) for t in b]) )
from torch import nnclass BOW_TextClassification(nn.Module):def__init__(self, vocab_size):# initialize the layers in the __init__ constructorsuper(BOW_TextClassification,self).__init__()# supercharge the sub-class by inheriting the defaults from parent classself.simple_linear_stack = torch.nn.Sequential( torch.nn.Linear(vocab_size,4),# torch.nn.Tanh(),# torch.nn.Linear(512,4), # 4 denotes the number of classes )def forward(self,features): softmax_values =self.simple_linear_stack(features)return softmax_valuesbow_model = BOW_TextClassification(vocab_size).to(device)
# setting hyperparameterslr =0.01optimizer = torch.optim.Adam(bow_model.parameters(), lr=lr)loss_fn = torch.nn.CrossEntropyLoss()epoch_size =1# just for checking how much time it takes
# number of training batcheslen(train_dataloader)
28500
pred.get_device()
0
def train_loop(bow_model, train_dataloader, validation_dataloader, epoch, lr=lr, optimizer=optimizer, loss_fn=loss_fn, ): train_size =len(train_dataloader.dataset) validation_size =len(validation_dataloader.dataset) training_loss_per_epoch =0 validation_loss_per_epoch =0for batch_number, (labels, features) inenumerate(train_dataloader):if batch_number %100==0:print(f"In epoch {epoch}, training of {batch_number} batches are over")if batch_number ==100:break labels, features = labels.to(device), features.to(device) labels = labels.clone().detach().requires_grad_(True).long().to(device)# labels = torch.tensor(labels, dtype=torch.long, device=device)# compute prediction and prediction error pred = bow_model(features)# print(pred.dtype, pred.shape) loss = loss_fn(pred, labels)# print(loss.dtype)# backpropagation steps# key optimizer steps# by default, gradients add up in PyTorch# we zero out in every iteration optimizer.zero_grad()# performs the gradient computation steps (across the DAG) loss.backward()# adjust the weights optimizer.step() training_loss_per_epoch += loss.item()for batch_number, (labels, features) inenumerate(validation_dataloader):if batch_number ==100:break labels, features = labels.to(device), features.to(device) labels = labels.clone().detach().requires_grad_(True).long().to(device)#labels, features = labels.to(device), features.to(device)#labels = torch.tensor(labels, dtype=torch.float32)# compute prediction error pred = bow_model(features) loss = loss_fn(pred, labels) validation_loss_per_epoch += loss.item() avg_training_loss = training_loss_per_epoch / train_size avg_validation_loss = validation_loss_per_epoch / validation_sizeprint(f"Average Training Loss of {epoch}: {avg_training_loss}")print(f"Average Validation Loss of {epoch}: {avg_validation_loss}")
def test_loop(bow_model,test_dataloader, epoch, loss_fn=loss_fn): test_size =len(test_dataloader.dataset)# Failing to do eval can yield inconsistent inference results bow_model.eval() bow_model.to(device) test_loss_per_epoch, accuracy_per_epoch =0, 0# disabling gradient tracking while inferencewith torch.no_grad():for labels, features in test_dataloader: labels, features = labels.to(device), features.to(device) labels = labels.clone().detach().requires_grad_(True).long().to(device)# labels = torch.tensor(labels, dtype=torch.long, device=device)# labels = torch.tensor(labels, dtype=torch.float32) pred = bow_model(features) loss = loss_fn(pred, labels) test_loss_per_epoch += loss.item() accuracy_per_epoch += (pred.argmax(1)==labels).type(torch.float).sum().item()print(f"Average Test Loss of {epoch}: {test_loss_per_epoch/test_size}")print(f"Average Accuracy of {epoch}: {accuracy_per_epoch/test_size}")
3.8 Training the Model
epoch_size
1
%%time# it takes a lot of time to run this model# hence running only for 100 batches (of size 4) in 1 epochfor epoch inrange(epoch_size):print(f"Epoch Number: {epoch}\n---------------------") train_loop(bow_model, train_dataloader, valid_dataloader, epoch ) test_loop(bow_model, test_dataloader, epoch)
Epoch Number: 0
---------------------
In epoch 0, training of 0 batches are over
In epoch 0, training of 100 batches are over
Average Training Loss of 0: 0.0004964731066373357
Average Validation Loss of 0: 0.008571766301679114
Average Test Loss of 0: 0.12454833071194835
Average Accuracy of 0: 0.8268421052631579
CPU times: user 3h 22min 19s, sys: 13.6 s, total: 3h 22min 32s
Wall time: 6min 14s
3.9.Test the model on sample text
ag_news_label = {1: "World",2: "Sports",3: "Business",4: "Sci/Tec"}def predict(text, model):with torch.no_grad(): bow_vector = to_bow(text) output = bow_model(bow_vector) output_label = ag_news_label[output.argmax().item() +1]return output_labelsample_string ="MEMPHIS, Tenn. – Four days ago, Jon Rahm was \ enduring the season’s worst weather conditions on Sunday at The \ Open on his way to a closing 75 at Royal Portrush, which \ considering the wind and the rain was a respectable showing. \ Thursday’s first round at the WGC-FedEx St. Jude Invitational \ was another story. With temperatures in the mid-80s and hardly any \ wind, the Spaniard was 13 strokes better in a flawless round. \ Thanks to his best putting performance on the PGA Tour, Rahm \ finished with an 8-under 62 for a three-stroke lead, which \ was even more impressive considering he’d never played the \ front nine at TPC Southwind."cpu_model = bow_model.to("cpu")print(f"This is a {predict(sample_string, model=cpu_model)} news")
This is a Sports news
4. Conclusion
In this blog piece, we looked at how bow vectorizer was used as input to build a shallow NN (without non-linear activation function) classification.
In the next parts to this Pytorch series, I will cover better ways to build a text classification NN model from scratch