PyTorch Fundamentals for NLP - Part 2

This blog post explains how to build Linear Text Classifiers using PyTorch’s modules such as nn.EmbeddingBag and nn.Embedding functions to convert tokenized text into embeddings
NLP
Coding
Author

Senthil Kumar

Published

September 15, 2021

1. Introduction

In this blog piece, let us cover how we can build a - text classification application using an embedding + fc layer

2.Representing Text as Tensors - A Quick Introduction

How do computers represent text? - Using encodings such as ASCII values to represent each character

Source: github.com/MicrosoftDocs/pytorchfundamentals

Still computers cannot interpret the meaning of the words , they just represent text as ascii numbers in the above image

How is text converted into embeddings?

  • Two types of representations to convert text into numbers

    • Character-level representation
    • Word-level representation
    • Token or sub-word level representation
  • While Character-level and Word-level representations are self explanatory, Token-level representation is a combination of the above two approaches.

Some important terms:

  • Tokenization (sentence/text –> tokens): In the case sub-word level representations, for example, unfriendly will be tokenized as un, #friend, #ly where # indicates the token is a continuation of previous token.

  • This way of tokenization can make the model learnt/trained representations for friend and unfriendly to be closer to each other in the vector spacy

  • Numericalization (tokens –> numericals): This is the step where we convert tokens into integers.

  • Vectorization (numericals –> vectors): This is the process of creating vectors (typically sparse and equal to the length of the vocabulary of the corpus analyzed)

  • Embedding (numericals –> embeddings): For text data, embedding is a lower dimensional equivalent of a higher dimensional sparse vector. Embeddings are typically dense. Vectors are sparse.


Typical Process of Embedding Creation
- text_data >> tokens >> numericals >> sparse vectors or dense embeddings

3. Difference between nn.EmbeddingBag vs nn.Embedding

  • nn.Embedding: A simple lookup table that looks up embeddings in a fixed dictionary and size.

  • nn.EmbeddingBag: Computes sums, means or maxes of bags of embeddings, without instantiating the intermediate embeddings.

Source: PyTorch Official Documentation

nn.Embedding Explanation: - In the above pic, we can see that the encoding of men write code being embedded as [(0.312,0.385), (0.543, 0.481), (0.203, 0.404)] where embed_dim=2. - Looking closer, men is embedded as (0.312,0.385) and the trailing <pad> token is embedded as (0.203, 0.404)

nn.EmbeddingBag Explanation:
- In here, there is no padding token. The sentences in a batch are connnected together and saved with their offsets array - Instead of each word being represented by an embedding vector, each sentence is embedded into a embedding vector - This above process of “computing a single vector for an entire sentence” is possible also from nn.Embedding followed by torch.mean(dim=1) or torch.sum(dim=1) or torch.max(dim=1)

So, when to use nn.EmbeddingBag? - nn.EmbeddingBag works better when sequential information of words is not needed. - Hence can be used with simple Feed forward NN and not with LSTMs or Transformers (all the embedded words are sent at once and they are sequentially processed either from both directions or unidirectional)

Sources:
- nn.EmbeddingBag vs nn.Embedding | link - nn.Emedding followed by torch.mean(dim=1) | link

4. A Text Classification Pipeline using nn.EmbeddingBag + nn.linear Layer

  • Dataset considered: AG_NEWS dataset that consists of 4 classes - World, Sports, Business and Sci/Tech

┣━━ 1.Loading dataset
┃ ┣━━ torch.data.utils.datasets.AG_NEWS
┣━━ 2.Load Tokenization
┃ ┣━━ torchtext.data.utils.get_tokenizer('basic_english')
┣━━ 3.Build vocabulary
┃ ┣━━ torchtext.vocab.build_vocab_from_iterator(train_iterator)
┣━━ 4.Create EmbeddingsBag
┃ ┣━━ Create collate_fn to create triplets of label-feature-offsets tensors for every minibatch
┣━━ 5.Create train, validation and test DataLoaders
┣━━ 6.Define Model_Architecture
┣━━ 7.define training_loop and testing_loop functions
┣━━ 8.Train the model and Evaluate on Test Data
┣━━ 9.Test the model on sample text

Importing basic modules

Code
import torch
import torchtext
import os
import collections
import random
import numpy as np

from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer

from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

4.1. Loading dataset

Code
def load_dataset(ngrams=1):
    print("Loading dataset ...")
    train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    return train_dataset, test_dataset
train_dataset, test_dataset = load_dataset()
Loading dataset ...
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

4.2. Loading Tokenizer

tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

4.3. Building Vocabulary

Code
def _yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)


def create_vocab(train_dataset):
    print("Building vocabulary ..")
    vocab = build_vocab_from_iterator(_yield_tokens(train_dataset),
                                      min_freq=1,
                                      specials=['<unk>']
                                     )
    vocab.set_default_index(vocab['<unk>'])
    return vocab
vocab = create_vocab(train_dataset)
Building vocabulary ..
vocab_size = len(vocab)
print("Vocab size =", vocab_size)
Vocab size = 95811
vocab(['this', 'is', 'a', 'sports', 'article','<unk>'])
[52, 21, 5, 262, 4229, 0]

Looking at some sample data

for label, text in random.sample(train_dataset, 3):
    print(label,classes[label-1])
    print(text)
    print("******")
1 World
EU-25 among least corrupt in global index Corruption is rampant in sixty countries of the world and the public sector continues to be plagued by bribery, says a report by a respected global corruption watchdog.
******
4 Sci/Tech
IDC Raises '04 PC Growth View, Trims '05 (Reuters) Reuters - Shipments of personal computers\this year will be higher than previously anticipated, boosted\by the strongest demand from businesses in five years, research\firm IDC said on Monday.
******
2 Sports
The not-so-great cover-up A crisis, they say, is the best way to test the efficiency of a system. At the Wankhede Stadium, there was a crisis on the first morning when unseasonal showers showed up on the first morning of the final Test.
******

4.4.1 Exploring arguments for nn.EmbeddingBag

nn.EmbeddingBag()(input_tensor, offsets)

from torch import nn 

input_tensor = torch.tensor([0, 1, 2, 3, 4, 3, 2, 1], dtype=torch.int64)
offsets = torch.tensor([0, 5], dtype=torch.long)
embedding_layer = nn.EmbeddingBag(num_embeddings=10,embedding_dim=3,sparse=True)
embedding_layer(input_tensor,offsets)
tensor([[ 0.0383,  0.0984, -0.4766],
        [-0.5284,  0.3360, -0.5838]], grad_fn=<EmbeddingBagBackward>)
4.4.2 Create Collate Function

# create collate batch function 
# to club labels, tokenized_text_converted_into_numbers and token_offsets
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
        label_list.append(_label_pipeline(_label))
        processed_text = torch.tensor(_text_pipeline(_text),
                                      dtype=torch.int64
                                     )
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list_overall = torch.tensor(label_list, dtype=torch.int64)
    text_list_overall = torch.cat(text_list)
    offsets_overall = torch.tensor(offsets[:-1]).cumsum(dim=0)
    return label_list_overall.to(device), text_list_overall.to(device), offsets_overall.to(device) 

4.5. Prepare DataLoaders

BATCH_SIZE = 4
from torch.utils.data.dataset import random_split

num_train = int(len(train_dataset) * 0.95)
num_train
114000
split_train_, split_valid_ = random_split(train_dataset, 
                                          [num_train, len(train_dataset) - num_train]
                                         )
type(split_train_)
torch.utils.data.dataset.Subset
split_train_.indices[0:5]
[1565, 113376, 44093, 96738, 56856]
train_dataloader = DataLoader(split_train_,
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              collate_fn=collate_batch
                             ) 

valid_dataloader = DataLoader(split_valid_,
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              collate_fn=collate_batch
                             )

test_dataloader = DataLoader(test_dataset,
                             batch_size=BATCH_SIZE,
                             shuffle=True,
                             collate_fn=collate_batch
                            )

4.6. Model Architecture

from torch import nn

class LinearTextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class=4):
        super(LinearTextClassifier,self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size,
                                         embed_dim,
                                         sparse=True
                                        )
        # fully connected layer
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()
        
    def init_weights(self):
        initrange = 0.5
        # initializing embedding weights as a uniform distribution
        self.embedding.weight.data.uniform_(-initrange, initrange)
        
        # initializing linear layer weights as a uniform distribution
        self.fc.weight.data.uniform_(-initrange, initrange)
        
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

Initializing model and embedding dimension

num_classes = len(set([label for (label, text) in train_dataset]))
num_classes
4
vocab_size = len(vocab)
embedding_dim = 64

# instantiating the class and pass on to device
model = LinearTextClassifier(vocab_size,
                             embedding_dim,
                             num_classes
                            ).to(device)
print(model)
LinearTextClassifier(
  (embedding): EmbeddingBag(95811, 64, mode=mean)
  (fc): Linear(in_features=64, out_features=4, bias=True)
)

4.7. Define train_loop and test_loop functions

Code
# setting hyperparameters
lr = 3
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

loss_fn = torch.nn.CrossEntropyLoss()

epoch_size = 10 # setting a low number to see time consumption
def train_loop(model, 
               train_dataloader,
               validation_dataloader,
               epoch,
               lr=lr,
               optimizer=optimizer,
               loss_fn=loss_fn,
              ):
    train_size = len(train_dataloader.dataset)
    validation_size = len(validation_dataloader.dataset)
    training_loss_per_epoch = 0
    validation_loss_per_epoch = 0
    for batch_number, (labels, features, offsets) in enumerate(train_dataloader):
        if batch_number %100 == 0:
            print(f"In epoch {epoch}, training of {batch_number} batches are over")
        # following two lines are used while for testing if the fns are accurate
        # if batch_number %10 == 0:
        #    break
        labels, features, offsets = labels.to(device), features.to(device), offsets.to(device)
        # labels = labels.clone().detach().requires_grad_(True).long().to(device)        # compute prediction and prediction error
        pred = model(features, offsets)
        
        # print(pred.dtype, pred.shape)
        loss = loss_fn(pred, labels)
        # print(loss.dtype)
        
        # backpropagation steps
        # key optimizer steps
        # by default, gradients add up in PyTorch
        # we zero out in every iteration
        optimizer.zero_grad()
        
        # performs the gradient computation steps (across the DAG)
        loss.backward()
        
        # adjust the weights
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        training_loss_per_epoch += loss.item()
        
    for batch_number, (labels, features, offsets) in enumerate(validation_dataloader):
        labels, features, offsets = labels.to(device), features.to(device), offsets.to(device)
        # labels = labels.clone().detach().requires_grad_(True).long().to(device)

        # compute prediction error
        pred = model(features, offsets)
        loss = loss_fn(pred, labels)
        
        validation_loss_per_epoch += loss.item()
    
    avg_training_loss = training_loss_per_epoch / train_size
    avg_validation_loss = validation_loss_per_epoch / validation_size
    print(f"Average Training Loss of {epoch}: {avg_training_loss}")
    print(f"Average Validation Loss of {epoch}: {avg_validation_loss}")
def test_loop(model,test_dataloader, epoch, loss_fn=loss_fn):
    test_size = len(test_dataloader.dataset)
    # Failing to do eval can yield inconsistent inference results
    model.eval()
    test_loss_per_epoch, accuracy_per_epoch = 0, 0
    # disabling gradient tracking while inference
    with torch.no_grad():
        for labels, features, offsets in test_dataloader:
            labels, features, offsets = labels.to(device), features.to(device), offsets.to(device)
            # labels = labels.clone().detach().requires_grad_(True).long().to(device)

            # labels = torch.tensor(labels, dtype=torch.float32)
            pred = model(features, offsets)
            loss = loss_fn(pred, labels)
            test_loss_per_epoch += loss.item()
            accuracy_per_epoch += (pred.argmax(1)==labels).type(torch.float).sum().item()
    # following two lines are used only while testing if the fns are accurate
    # print(f"Last Prediction \n 1. {pred}, \n 2.{pred.argmax()}, \n 3.{pred.argmax(1)}, \n 4.{pred.argmax(1)==labels}")
    # print(f"Last predicted label: \n {labels}")
    print(f"Average Test Loss of {epoch}: {test_loss_per_epoch/test_size}")
    print(f"Average Accuracy of {epoch}: {accuracy_per_epoch/test_size}")

4.8 Training the Model

# checking for 1 epoch, testing for 1 epoch
epoch = 1
train_loop(model,
           train_dataloader, 
           valid_dataloader,
           epoch
          )

test_loop(model, 
          test_dataloader,
          epoch)
epoch_size
10
%%time
# it takes time to run this model
for epoch in range(epoch_size):
    print(f"Epoch Number: {epoch} \n---------------------")
    train_loop(model, 
               train_dataloader, 
               valid_dataloader,
               epoch
              )
    test_loop(model, 
              test_dataloader,
              epoch)
Epoch Number: 0 
---------------------
In epoch 0, training of 0 batches are over
In epoch 0, training of 100 batches are over
In epoch 0, training of 200 batches are over
In epoch 0, training of 300 batches are over
In epoch 0, training of 400 batches are over
In epoch 0, training of 500 batches are over
In epoch 0, training of 600 batches are over
In epoch 0, training of 700 batches are over
In epoch 0, training of 800 batches are over
In epoch 0, training of 900 batches are over
In epoch 0, training of 1000 batches are over
.....
In epoch 4, training of 28000 batches are over
In epoch 4, training of 28100 batches are over
In epoch 4, training of 28200 batches are over
In epoch 4, training of 28300 batches are over
In epoch 4, training of 28400 batches are over
Average Training Loss of 4: 0.06850743169194017
Average Validation Loss of 4: 0.10275975785597817
Average Test Loss of 4: 0.10988466973246012
Average Accuracy of 4: 0.9142105263157895
Epoch Number: 5 
---------------------
In epoch 5, training of 0 batches are over
.....
In epoch 9, training of 28200 batches are over
In epoch 9, training of 28300 batches are over
In epoch 9, training of 28400 batches are over
Average Training Loss of 9: 0.05302847929670939
Average Validation Loss of 9: 0.12290485648680452
Average Test Loss of 9: 0.1306164133289306
Average Accuracy of 9: 0.9111842105263158
CPU times: user 5min 56s, sys: 25.8 s, total: 6min 22s
Wall time: 6min 13s

4.9.Test the model on sample text

ag_news_label = {1: "World",
                 2: "Sports",
                 3: "Business",
                 4: "Sci/Tec"}

def predict(text, model):
    with torch.no_grad():
        tokenized_numericalized_vector = torch.tensor(_text_pipeline(text))
        offsets = torch.tensor([0])
        output = model(tokenized_numericalized_vector, 
                       offsets)
        output_label = ag_news_label[output.argmax(1).item() + 1]
        return output_label
    
sample_string = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

cpu_model = model.to("cpu")

print(f"This is a {predict(sample_string, model=cpu_model)} news")
This is a Sports news

5. A Text Classification Pipeline using nn.Embedding + nn.linear Layer

  • Dataset considered: AG_NEWS dataset that consists of 4 classes - World, Sports, Business and Sci/Tech

(same architecture as previous one, except the change in step 4)

┣━━ 1.Loading dataset
┃ ┣━━ torch.data.utils.datasets.AG_NEWS
┣━━ 2.Load Tokenization
┃ ┣━━ torchtext.data.utils.get_tokenizer('basic_english')
┣━━ 3.Build vocabulary
┃ ┣━━ torchtext.vocab.build_vocab_from_iterator(train_iterator)
┣━━ 4.Create Embedding layer
┃ ┣━━ Create collate_fn (padify) to create pairs of label-feature tensors for every minibatch
┣━━ 5.Create train, validation and test DataLoaders
┣━━ 6.Define Model_Architecture
┣━━ 7.define training_loop and testing_loop functions
┣━━ 8.Train the model and Evaluate on Test Data
┣━━ 9.Test the model on sample text

Importing basic modules

Code
import torch
import torchtext
import os
import collections
import random
import numpy as np

from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer

from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

5.1. Loading dataset

Code
def load_dataset(ngrams=1):
    print("Loading dataset ...")
    train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    return train_dataset, test_dataset
train_dataset, test_dataset = load_dataset()
Loading dataset ...
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

5.2. Loading Tokenizer

tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

5.3. Building Vocabulary

Code
def _yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)


def create_vocab(train_dataset):
    print("Building vocabulary ..")
    vocab = build_vocab_from_iterator(_yield_tokens(train_dataset),
                                      min_freq=1,
                                      specials=['<unk>']
                                     )
    vocab.set_default_index(vocab['<unk>'])
    return vocab
vocab = create_vocab(train_dataset)
Building vocabulary ..
vocab_size = len(vocab)
print("Vocab size =", vocab_size)
Vocab size = 95811
vocab(['this', 'is', 'a', 'sports', 'article','<unk>'])
[52, 21, 5, 262, 4229, 0]

5.4. Creating nn.Embedding related pipelines

  • The text pipeline purpose is to convert text into tokens
  • the label pipeline is to have labels from 0 to 3
_text_pipeline = lambda x: vocab(tokenizer(x))
_label_pipeline = lambda x: int(x) - 1
_text_pipeline("this is a sports article")
[52, 21, 5, 262, 4229]
_label_pipeline('3')
2

5.4.1 Exploring arguments for nn.Embedding

  • nn.Embedding: A simple lookup table that stores embeddings of a fixed dictionary and size.
  • Let us create an Embedding module containing 10 tensors of size 3
from torch import nn
embedding = nn.Embedding(5, 3)
for i in range(5):
    print(embedding(torch.tensor([i])))
tensor([[ 1.2225,  0.7789, -1.1441]], grad_fn=<EmbeddingBackward>)
tensor([[1.3428, 1.2356, 0.6745]], grad_fn=<EmbeddingBackward>)
tensor([[-0.6605, -1.5354, -0.4195]], grad_fn=<EmbeddingBackward>)
tensor([[-0.9991,  1.7851, -1.6268]], grad_fn=<EmbeddingBackward>)
tensor([[0.7723, 2.0980, 0.3080]], grad_fn=<EmbeddingBackward>)
an_array_input = torch.tensor([[1,2,4,3]])
embedding(an_array_input)
tensor([[[ 1.3428,  1.2356,  0.6745],
         [-0.6605, -1.5354, -0.4195],
         [ 0.7723,  2.0980,  0.3080],
         [-0.9991,  1.7851, -1.6268]]], grad_fn=<EmbeddingBackward>)

5.4.2. Create Collate Function

Dealing with Variable Sequence Size

  • Every data point in a text corpus could have different number of tokens
  • For maintaining uniform number of input tokens in texts, we padify the text
  • torch.nn.functional.pad on a tokenized dataset can padify the dataset

Source: Microsoft PyTorch Docs

def padify(batch):
    # batch is a list of (label, text) pair of tuples
    label_list, text_list = [], []
    for (_label, _text) in batch:
        label_list.append(_label_pipeline(_label))
        tokenized_numericalized_text = torch.tensor(_text_pipeline(_text),
                                                    dtype=torch.int64 
                                                   )
        text_list.append(tokenized_numericalized_text)
    # compute max length of a sequence in this minibatch
    max_length = max(map(len,text_list))
    label_list_overall = torch.tensor(label_list, 
                                      dtype=torch.int64
                                     )
    text_list_overall = torch.stack([torch.nn.functional.pad(torch.tensor(t),
                                                             (0,max_length - len(t)),
                                                             mode='constant',
                                                             value=0) for t in text_list
                                    ])
    return label_list_overall.to(device), text_list_overall.to(device)

5.5. Prepare DataLoaders

BATCH_SIZE = 4
from torch.utils.data.dataset import random_split

num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = random_split(train_dataset, 
                                          [num_train, len(train_dataset) - num_train]
                                         )
split_train_.indices[0:5]
[13748, 32598, 23674, 26304, 9007]
train_dataloader = DataLoader(split_train_,
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              collate_fn=padify
                             ) 

valid_dataloader = DataLoader(split_valid_,
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              collate_fn=padify
                             )

test_dataloader = DataLoader(test_dataset,
                             batch_size=BATCH_SIZE,
                             shuffle=True,
                             collate_fn=padify
                            )
for i, (labels, features) in enumerate(test_dataloader):
    print(f"Tracking batch {i}")
    print(labels.shape)
    print(features.shape)
    if i == 3:
        break
    print("****")
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:18: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
Tracking batch 0
torch.Size([4])
torch.Size([4, 52])
****
Tracking batch 1
torch.Size([4])
torch.Size([4, 62])
****
Tracking batch 2
torch.Size([4])
torch.Size([4, 50])
****
Tracking batch 3
torch.Size([4])
torch.Size([4, 47])

5.6 Model Architecture

from torch import nn

class LinearTextClassifier_2(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class=4):
        super(LinearTextClassifier_2,self).__init__()
        self.embedding = nn.Embedding(vocab_size,
                                         embed_dim,
                                     )
        # fully connected layer
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()
        
    def init_weights(self):
        initrange = 0.5
        # initializing embedding weights as a uniform distribution
        self.embedding.weight.data.uniform_(-initrange, initrange)
        
        # initializing linear layer weights as a uniform distribution
        self.fc.weight.data.uniform_(-initrange, initrange)
        
        self.fc.bias.data.zero_()

    def forward(self, x):
        x = self.embedding(x)
        x = torch.mean(x, dim=1)
        return self.fc(x)

Initializing the model hyperaparameters

num_classes = len(set([label for (label, text) in train_dataset]))
vocab_size = len(vocab)
embedding_dim = 64

# instantiating the class and pass on to device
model_2 = LinearTextClassifier_2(vocab_size,
                               embedding_dim,
                               num_classes
                              ).to(device)
print(model_2)
LinearTextClassifier_2(
  (embedding): Embedding(95811, 64)
  (fc): Linear(in_features=64, out_features=4, bias=True)
)
Code
# setting hyperparameters
lr = 0.001
optimizer = torch.optim.Adam(model_2.parameters(), lr=lr)

loss_fn = torch.nn.CrossEntropyLoss()

epoch_size = 3 # setting a low number to see time consumption

5.7. Define train_loop and test_loop functions

def train_loop_2(model_2, 
               train_dataloader,
               validation_dataloader,
               epoch,
               lr=lr,
               optimizer=optimizer,
               loss_fn=loss_fn,
              ):
    train_size = len(train_dataloader.dataset)
    validation_size = len(validation_dataloader.dataset)
    training_loss_per_epoch = 0
    validation_loss_per_epoch = 0
    for batch_number, (labels, features) in enumerate(train_dataloader):
        if batch_number %1000 == 0:
            print(f"In epoch {epoch}, training of {batch_number} batches are over")
        # following two lines are used while for testing if the fns are accurate
        # if batch_number %10 == 0:
        #    break
        labels, features = labels.to(device), features.to(device)
        # labels = labels.clone().detach().requires_grad_(True).long().to(device)        # compute prediction and prediction error
        pred = model_2(features)
        
        # print(pred.dtype, pred.shape)
        loss = loss_fn(pred, labels)
        # print(loss.dtype)
        
        # backpropagation steps
        # key optimizer steps
        # by default, gradients add up in PyTorch
        # we zero out in every iteration
        optimizer.zero_grad()
        
        # performs the gradient computation steps (across the DAG)
        loss.backward()
        
        # adjust the weights
        # torch.nn.utils.clip_grad_norm_(model_2.parameters(), 0.1)
        optimizer.step()
        training_loss_per_epoch += loss.item()
        
    for batch_number, (labels, features) in enumerate(validation_dataloader):
        labels, features = labels.to(device), features.to(device)
        # labels = labels.clone().detach().requires_grad_(True).long().to(device)

        # compute prediction error
        pred = model_2(features)
        loss = loss_fn(pred, labels)
        
        validation_loss_per_epoch += loss.item()
    
    avg_training_loss = training_loss_per_epoch / train_size
    avg_validation_loss = validation_loss_per_epoch / validation_size
    print(f"Average Training Loss of {epoch}: {avg_training_loss}")
    print(f"Average Validation Loss of {epoch}: {avg_validation_loss}")
def test_loop_2(model_2,test_dataloader, epoch, loss_fn=loss_fn):
    test_size = len(test_dataloader.dataset)
    # Failing to do eval can yield inconsistent inference results
    model_2.eval()
    test_loss_per_epoch, accuracy_per_epoch = 0, 0
    # disabling gradient tracking while inference
    with torch.no_grad():
        for labels, features in test_dataloader:
            labels, features = labels.to(device), features.to(device)
            # labels = labels.clone().detach().requires_grad_(True).long().to(device)

            # labels = torch.tensor(labels, dtype=torch.float32)
            pred = model_2(features)
            loss = loss_fn(pred, labels)
            test_loss_per_epoch += loss.item()
            accuracy_per_epoch += (pred.argmax(1)==labels).type(torch.float).sum().item()
    # following two lines are used only while testing if the fns are accurate
    # print(f"Last Prediction \n 1. {pred}, \n 2.{pred.argmax()}, \n 3.{pred.argmax(1)}, \n 4.{pred.argmax(1)==labels}")
    # print(f"Last predicted label: \n {labels}")
    print(f"Average Test Loss of {epoch}: {test_loss_per_epoch/test_size}")
    print(f"Average Accuracy of {epoch}: {accuracy_per_epoch/test_size}")

5.8 Training the model

# checking for 1 epoch, testing for 1 epoch
epoch = 1
train_loop_2(model_2,
           train_dataloader, 
           valid_dataloader,
           epoch
          )

test_loop_2(model_2, 
          test_dataloader,
          epoch)
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:18: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
In epoch 1, training of 0 batches are over
In epoch 1, training of 1000 batches are over
In epoch 1, training of 2000 batches are over
In epoch 1, training of 3000 batches are over
.....
In epoch 1, training of 26000 batches are over
In epoch 1, training of 27000 batches are over
In epoch 1, training of 28000 batches are over
Average Training Loss of 1: 0.08683817907764081
Average Validation Loss of 1: 0.0605684169218495
Average Test Loss of 1: 0.06346218769094585
Average Accuracy of 1: 0.9201315789473684
Code
# it takes time to run this model
for epoch in range(epoch_size):
    print(f"Epoch Number: {epoch} \n---------------------")
    train_loop_2(model_2, 
               train_dataloader, 
               valid_dataloader,
               epoch
              )
    test_loop_2(model_2, 
              test_dataloader,
              epoch)

5.9. Test the model on sample text

ag_news_label = {1: "World",
                 2: "Sports",
                 3: "Business",
                 4: "Sci/Tec"}


def predict_2(text, model):
    batch = [(torch.tensor([0]),
              text
             )
            ]
    with torch.no_grad():
        _, padded_sequence = padify(batch)
        padded_sequence = padded_sequence.to("cpu")
        # tokenized_numericalized_vector = torch.tensor(_text_pipeline(text))
        output = model_2(padded_sequence)
        output_label = ag_news_label[output.argmax(1).item() + 1]
        return output_label
    
sample_string = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

cpu_model_2 = model_2.to("cpu")

print(f"This is a {predict_2(sample_string, model=cpu_model_2)} news")
This is a Sports news
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:18: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).

6. Conclusion

  • In this blog piece, we looked at how we can build linear classifiers (without non-linear activation functions) over nn.EmbeddingBag and nn.Embedding modules
  • In the nn.EmbeddingBag method of embedding creation, we did not create padding tokens but have to track offsets for every minibatch.
  • In the nn.Embedding method of creating embeddings, we used torch.nn.functional.pad function to ensure all text sequences have fixed length

Sources

  • MSFT PyTorch NLP Course | link
  • Official PyTorch Tutorial on Text Classification using nn.EmbeddingBag | link
  • MSFT PyTorch Text Classification using nn.Embedding | link