PyTorch Fundamentals for NLP - Part 1

1. Introduction

Why NLP has grown in recent years? - Because of the improvement in the ability of Language Models (such as BERT or GPT-3) to accurately understand human language - Easy to train these LMs as they learn from performing unsupervised pretraining tasks

What are the common types of NLP Applications for which NNs are built? - Text Classification | E.g.: Email Spam classification, Intent Classification of incomming messages in Chatbots - Sentiment Analysis | A regression task (outputs a number from most negative -1 to most positive +1 | Note: Training data needs to have outputs in range too) - NER | a component of Information Retrieval | We classify every token (typically tokens that are proper nouns) a pre-defined entity which is then used for some downstream - NER and Intent Classification can be used together with intent classification - E.g.: “Ok Google, Search apartments in Thoraipakam” - Intent: Search | Entity_1 (search_entity) apartments | Entity_2 (search_filter_location) Thoraipakkam - Text Summarization - Question-Answer Systems | Typicall Closed domain system where in the answer to a question is in the context - Context: “Joe Biden became US President in 2021 succedding Donald Trump” - Query: “Who was the President of the US before Joe Biden”

In this blog piece, let us cover - text classification task using a bow based vectorizer + nn.Linear layer

2.Representing Text as Tensors - A Quick Introduction

How do computers represent text? - Using encodings such as ASCII values to represent each character

Source: github.com/MicrosoftDocs/pytorchfundamentals

Still computers cannot interpret the meaning of the words , they just represent text as ascii numbers in the above image

How is text converted into embeddings?

Two types of representations to convert text into numbers
- Character-level representation
- Word-level representation
- Token or sub-word level representation
While Character-level and Word-level representations are self explanatory, Token-level representation is a combination of the above two approaches.

Some important terms:

Tokenization (sentence/text –> tokens): In the case sub-word level representations, for example, unfriendly will be tokenized as un, #friend, #ly where # indicates the token is a continuation of previous token.
This way of tokenization can make the model learnt/trained representations for friend and unfriendly to be closer to each other in the vector spacy
Numericalization (tokens –> numericals): This is the step where we convert tokens into integers.
Vectorization (numericals –> vectors): This is the process of creating vectors (typically sparse and equal to the length of the vocabulary of the corpus analyzed)
Embedding (numericals –> embeddings): For text data, embedding is a lower dimensional equivalent of a higher dimensional sparse vector. Embeddings are typically dense. Vectors are sparse.

Typical Process of Embedding Creation
- text_data >> tokens >> numericals >> sparse vectors or dense embeddings

3. A Text Classification Pipeline to build BoW Classifier

Dataset considered: AG_NEWS dataset that consists of 4 classes - World, Sports, Business and Sci/Tech

┣━━ 1.Loading dataset
┃ ┣━━ torch.data.utils.datasets.AG_NEWS
┣━━ 2.Load Tokenization
┃ ┣━━ torchtext.data.utils.get_tokenizer('basic_english')
┣━━ 3.Build vocabulary
┃ ┣━━ torchtext.vocab.build_vocab_from_iterator(train_iterator)
┣━━ 4.Create BoW supporting functions
┃ ┣━━ Convert text_2_BoW_vector
┃ ┣━━ Create collate_fn to create a pair of label-feature tensors for every minibatch
┣━━ 5.Create train, validation and test DataLoaders
┣━━ 6.Define Model_Architecture
┣━━ 7.define training_loop and testing_loop functions
┣━━ 8.Train the model and Evaluate on Test Data
┣━━ 9.Test the model on sample text

Importing basic modules

Code

import torch
import torchtext
import os
import collections
import random
import numpy as np

from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer

from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

3.1. Loading dataset

Code

def load_dataset(ngrams=1):
    print("Loading dataset ...")
    train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    return train_dataset, test_dataset

train_dataset, test_dataset = load_dataset()

Loading dataset ...

classes = ['World', 'Sports', 'Business', 'Sci/Tech']

3.2. Loading Tokenizer

tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

3.3. Building Vocabulary

Code

def _yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)


def create_vocab(train_dataset):
    print("Building vocabulary ..")
    vocab = build_vocab_from_iterator(_yield_tokens(train_dataset),
                                      min_freq=1,
                                      specials=['<unk>']
                                     )
    vocab.set_default_index(vocab['<unk>'])
    return vocab

vocab = create_vocab(train_dataset)

Building vocabulary ..

vocab_size = len(vocab)
print("Vocab size =", vocab_size)

Vocab size = 95811

vocab(['this', 'is', 'a', 'sports', 'article','<unk>'])

[52, 21, 5, 262, 4229, 0]

Looking at some sample data

for label, text in random.sample(train_dataset, 3):
    print(label,classes[label-1])
    print(text)
    print("******")

1 World
Burgers for the Health Professional Even as obesity and its consequences are increasingly taxing the health care system, fast food places are serving as hospital cafeterias.
******
4 Sci/Tech
Climate Talks Bring Bush #39;s Policy to Fore  Glaciers in the Antarctic and in Greenland are melting much faster than expected, and the fastest moving glacier in the world has doubled its speed.
******
3 Business
Bush Health Savings Accounts Slow to Gain Acceptance So far employers and their workers have been slow to accept health savings accounts as an alternative to conventional health insurance.
******

3.4. Creating BoW related functions

The text pipeline purpose is to convert text into tokens
the label pipeline is to have labels from 0 to 3

_text_pipeline = lambda x: vocab(tokenizer(x))
_label_pipeline = lambda x: int(x) - 1

_text_pipeline("this is a sports article")

[52, 21, 5, 262, 4229]

_label_pipeline('3')

In Bag of Words (BOW) representation,
- each word is linked to a vector index - where the vector value in that index is the frequency of occurrence of the word in the given document

Source: Microsoft Docs

3.4.1 Creating `text_2_bow_vector`

def to_bow(text,
           bow_vocab_size=vocab_size
          ):
    res = torch.zeros(bow_vocab_size,dtype=torch.float32)
    for i in _text_pipeline(text):
        if i<bow_vocab_size:
            res[i] += 1
    return res

print(f"sample text:\n{train_dataset[0][1]}")
print(f"\nBoW vector:\n{to_bow(train_dataset[0][1])}")

sample text:
Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.

BoW vector:
tensor([0., 2., 1.,  ..., 0., 0., 0.])

3.4.2 Create Collate Function

# the collate function
# this collate function gets list of batch_size tuples, and needs to 
# return a pair of label-feature tensors for the whole minibatch
def bowify(b):
    return (
            torch.tensor([t[0]-1 for t in b],dtype=torch.float32),
            torch.stack([to_bow(t[1]) for t in b])
    )

3.5. Prepare DataLoaders

BATCH_SIZE = 4

from torch.utils.data.dataset import random_split

num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])

train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=bowify)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=bowify)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=bowify)

3.6. Model Architecture

from torch import nn

class BOW_TextClassification(nn.Module):
    def __init__(self, vocab_size):
        # initialize the layers in the __init__ constructor
        super(BOW_TextClassification,self).__init__()
        # supercharge the sub-class by inheriting the defaults from parent class
        self.simple_linear_stack = torch.nn.Sequential(
            torch.nn.Linear(vocab_size,4),
            # torch.nn.Tanh(),
            # torch.nn.Linear(512,4), # 4 denotes the number of classes
            )
        
    def forward(self,features):
        softmax_values = self.simple_linear_stack(features)
        return softmax_values

bow_model = BOW_TextClassification(vocab_size).to(device)

print(bow_model)

BOW_TextClassification(
  (simple_linear_stack): Sequential(
    (0): Linear(in_features=95811, out_features=4, bias=True)
  )
)

3.7. Define `train_loop` and `test_loop` functions

Code

# setting hyperparameters
lr = 0.01
optimizer = torch.optim.Adam(bow_model.parameters(), lr=lr)

loss_fn = torch.nn.CrossEntropyLoss()

epoch_size = 1 # just for checking how much time it takes

# number of training batches
len(train_dataloader)

pred.get_device()

def train_loop(bow_model, 
               train_dataloader,
               validation_dataloader,
               epoch,
               lr=lr,
               optimizer=optimizer,
               loss_fn=loss_fn,
              ):
    train_size = len(train_dataloader.dataset)
    validation_size = len(validation_dataloader.dataset)
    training_loss_per_epoch = 0
    validation_loss_per_epoch = 0
    for batch_number, (labels, features) in enumerate(train_dataloader):
        if batch_number %100 == 0:
            print(f"In epoch {epoch}, training of {batch_number} batches are over")
        if batch_number == 100:
            break
        labels, features = labels.to(device), features.to(device)
        labels = labels.clone().detach().requires_grad_(True).long().to(device)
        # labels = torch.tensor(labels, dtype=torch.long, device=device)
        # compute prediction and prediction error
        pred = bow_model(features)
        # print(pred.dtype, pred.shape)
        loss = loss_fn(pred, labels)
        # print(loss.dtype)
        
        # backpropagation steps
        # key optimizer steps
        # by default, gradients add up in PyTorch
        # we zero out in every iteration
        optimizer.zero_grad()
        
        # performs the gradient computation steps (across the DAG)
        loss.backward()
        
        # adjust the weights
        optimizer.step()
        training_loss_per_epoch += loss.item()
        
    for batch_number, (labels, features) in enumerate(validation_dataloader):
        if batch_number == 100:
            break
        labels, features = labels.to(device), features.to(device)
        labels = labels.clone().detach().requires_grad_(True).long().to(device)
        #labels, features = labels.to(device), features.to(device)
        #labels = torch.tensor(labels, dtype=torch.float32)
        # compute prediction error
        pred = bow_model(features)
        loss = loss_fn(pred, labels)
        
        validation_loss_per_epoch += loss.item()
    
    avg_training_loss = training_loss_per_epoch / train_size
    avg_validation_loss = validation_loss_per_epoch / validation_size
    print(f"Average Training Loss of {epoch}: {avg_training_loss}")
    print(f"Average Validation Loss of {epoch}: {avg_validation_loss}")

def test_loop(bow_model,test_dataloader, epoch, loss_fn=loss_fn):
    test_size = len(test_dataloader.dataset)
    # Failing to do eval can yield inconsistent inference results
    bow_model.eval()
    bow_model.to(device)
    test_loss_per_epoch, accuracy_per_epoch = 0, 0
    # disabling gradient tracking while inference
    with torch.no_grad():
        for labels, features in test_dataloader:
            labels, features = labels.to(device), features.to(device)
            labels = labels.clone().detach().requires_grad_(True).long().to(device)
            # labels = torch.tensor(labels, dtype=torch.long, device=device)
            # labels = torch.tensor(labels, dtype=torch.float32)
            pred = bow_model(features)
            loss = loss_fn(pred, labels)
            test_loss_per_epoch += loss.item()
            accuracy_per_epoch += (pred.argmax(1)==labels).type(torch.float).sum().item()
    print(f"Average Test Loss of {epoch}: {test_loss_per_epoch/test_size}")
    print(f"Average Accuracy of {epoch}: {accuracy_per_epoch/test_size}")

3.8 Training the Model

epoch_size

%%time
# it takes a lot of time to run this model
# hence running only for 100 batches (of size 4) in 1 epoch
for epoch in range(epoch_size):
    print(f"Epoch Number: {epoch} \n---------------------")
    train_loop(bow_model, 
               train_dataloader, 
               valid_dataloader,
               epoch
              )
    test_loop(bow_model, 
              test_dataloader,
              epoch)

Epoch Number: 0 
---------------------
In epoch 0, training of 0 batches are over
In epoch 0, training of 100 batches are over
Average Training Loss of 0: 0.0004964731066373357
Average Validation Loss of 0: 0.008571766301679114
Average Test Loss of 0: 0.12454833071194835
Average Accuracy of 0: 0.8268421052631579
CPU times: user 3h 22min 19s, sys: 13.6 s, total: 3h 22min 32s
Wall time: 6min 14s

3.9.Test the model on sample text

ag_news_label = {1: "World",
                 2: "Sports",
                 3: "Business",
                 4: "Sci/Tec"}

def predict(text, model):
    with torch.no_grad():
        bow_vector = to_bow(text)
        output = bow_model(bow_vector)
        output_label = ag_news_label[output.argmax().item() + 1]
        return output_label
    
sample_string = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

cpu_model = bow_model.to("cpu")

print(f"This is a {predict(sample_string, model=cpu_model)} news")

This is a Sports news

4. Conclusion

In this blog piece, we looked at how bow vectorizer was used as input to build a shallow NN (without non-linear activation function) classification.
In the next parts to this Pytorch series, I will cover better ways to build a text classification NN model from scratch

Sources

MSFT PyTorch NLP Course | link
MSFT PyTorch Course - BoW Classifier | link
Torchtext Tutorial on Text Classification | link

1. Introduction

2.Representing Text as Tensors - A Quick Introduction

3. A Text Classification Pipeline to build BoW Classifier

3.1. Loading dataset

3.2. Loading Tokenizer

3.3. Building Vocabulary

3.4. Creating BoW related functions

3.4.1 Creating text_2_bow_vector

3.4.2 Create Collate Function

3.5. Prepare DataLoaders

3.6. Model Architecture

3.7. Define train_loop and test_loop functions

3.8 Training the Model

3.9.Test the model on sample text

4. Conclusion

3.4.1 Creating `text_2_bow_vector`

3.7. Define `train_loop` and `test_loop` functions