PyTorch Fundamentals for NLP - Part 1

This blog post explains the use of PyTorch for building a bow-based Text Classifier

Senthil Kumar


August 28, 2021

1. Introduction

Why NLP has grown in recent years? - Because of the improvement in the ability of Language Models (such as BERT or GPT-3) to accurately understand human language - Easy to train these LMs as they learn from performing unsupervised pretraining tasks

What are the common types of NLP Applications for which NNs are built? - Text Classification | E.g.: Email Spam classification, Intent Classification of incomming messages in Chatbots - Sentiment Analysis | A regression task (outputs a number from most negative -1 to most positive +1 | Note: Training data needs to have outputs in range too) - NER | a component of Information Retrieval | We classify every token (typically tokens that are proper nouns) a pre-defined entity which is then used for some downstream - NER and Intent Classification can be used together with intent classification - E.g.: “Ok Google, Search apartments in Thoraipakam” - Intent: Search | Entity_1 (search_entity) apartments | Entity_2 (search_filter_location) Thoraipakkam - Text Summarization - Question-Answer Systems | Typicall Closed domain system where in the answer to a question is in the context - Context: “Joe Biden became US President in 2021 succedding Donald Trump” - Query: “Who was the President of the US before Joe Biden

In this blog piece, let us cover - text classification task using a bow based vectorizer + nn.Linear layer

2.Representing Text as Tensors - A Quick Introduction

How do computers represent text? - Using encodings such as ASCII values to represent each character


Still computers cannot interpret the meaning of the words , they just represent text as ascii numbers in the above image

How is text converted into embeddings?

  • Two types of representations to convert text into numbers

    • Character-level representation
    • Word-level representation
    • Token or sub-word level representation
  • While Character-level and Word-level representations are self explanatory, Token-level representation is a combination of the above two approaches.

Some important terms:

  • Tokenization (sentence/text –> tokens): In the case sub-word level representations, for example, unfriendly will be tokenized as un, #friend, #ly where # indicates the token is a continuation of previous token.

  • This way of tokenization can make the model learnt/trained representations for friend and unfriendly to be closer to each other in the vector spacy

  • Numericalization (tokens –> numericals): This is the step where we convert tokens into integers.

  • Vectorization (numericals –> vectors): This is the process of creating vectors (typically sparse and equal to the length of the vocabulary of the corpus analyzed)

  • Embedding (numericals –> embeddings): For text data, embedding is a lower dimensional equivalent of a higher dimensional sparse vector. Embeddings are typically dense. Vectors are sparse.

Typical Process of Embedding Creation
- text_data >> tokens >> numericals >> sparse vectors or dense embeddings

3. A Text Classification Pipeline to build BoW Classifier

  • Dataset considered: AG_NEWS dataset that consists of 4 classes - World, Sports, Business and Sci/Tech

┣━━ 1.Loading dataset
┃ ┣━━
┣━━ 2.Load Tokenization
┃ ┣━━'basic_english')
┣━━ 3.Build vocabulary
┃ ┣━━ torchtext.vocab.build_vocab_from_iterator(train_iterator)
┣━━ 4.Create BoW supporting functions
┃ ┣━━ Convert text_2_BoW_vector
┃ ┣━━ Create collate_fn to create a pair of label-feature tensors for every minibatch
┣━━ 5.Create train, validation and test DataLoaders
┣━━ 6.Define Model_Architecture
┣━━ 7.define training_loop and testing_loop functions
┣━━ 8.Train the model and Evaluate on Test Data
┣━━ 9.Test the model on sample text

Importing basic modules

import torch
import torchtext
import os
import collections
import random
import numpy as np

from torchtext.vocab import build_vocab_from_iterator
from import get_tokenizer

from import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

3.1. Loading dataset

def load_dataset(ngrams=1):
    print("Loading dataset ...")
    train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    return train_dataset, test_dataset
train_dataset, test_dataset = load_dataset()
Loading dataset ...
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

3.2. Loading Tokenizer

tokenizer ='basic_english')

3.3. Building Vocabulary

def _yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

def create_vocab(train_dataset):
    print("Building vocabulary ..")
    vocab = build_vocab_from_iterator(_yield_tokens(train_dataset),
    return vocab
vocab = create_vocab(train_dataset)
Building vocabulary ..
vocab_size = len(vocab)
print("Vocab size =", vocab_size)
Vocab size = 95811
vocab(['this', 'is', 'a', 'sports', 'article','<unk>'])
[52, 21, 5, 262, 4229, 0]

Looking at some sample data

for label, text in random.sample(train_dataset, 3):
1 World
Burgers for the Health Professional Even as obesity and its consequences are increasingly taxing the health care system, fast food places are serving as hospital cafeterias.
4 Sci/Tech
Climate Talks Bring Bush #39;s Policy to Fore  Glaciers in the Antarctic and in Greenland are melting much faster than expected, and the fastest moving glacier in the world has doubled its speed.
3 Business
Bush Health Savings Accounts Slow to Gain Acceptance So far employers and their workers have been slow to accept health savings accounts as an alternative to conventional health insurance.

3.5. Prepare DataLoaders

from import random_split

num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])
train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=bowify)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=bowify)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=bowify)

3.6. Model Architecture

from torch import nn

class BOW_TextClassification(nn.Module):
    def __init__(self, vocab_size):
        # initialize the layers in the __init__ constructor
        # supercharge the sub-class by inheriting the defaults from parent class
        self.simple_linear_stack = torch.nn.Sequential(
            # torch.nn.Tanh(),
            # torch.nn.Linear(512,4), # 4 denotes the number of classes
    def forward(self,features):
        softmax_values = self.simple_linear_stack(features)
        return softmax_values

bow_model = BOW_TextClassification(vocab_size).to(device)        
  (simple_linear_stack): Sequential(
    (0): Linear(in_features=95811, out_features=4, bias=True)

3.7. Define train_loop and test_loop functions

# setting hyperparameters
lr = 0.01
optimizer = torch.optim.Adam(bow_model.parameters(), lr=lr)

loss_fn = torch.nn.CrossEntropyLoss()

epoch_size = 1 # just for checking how much time it takes
# number of training batches
def train_loop(bow_model, 
    train_size = len(train_dataloader.dataset)
    validation_size = len(validation_dataloader.dataset)
    training_loss_per_epoch = 0
    validation_loss_per_epoch = 0
    for batch_number, (labels, features) in enumerate(train_dataloader):
        if batch_number %100 == 0:
            print(f"In epoch {epoch}, training of {batch_number} batches are over")
        if batch_number == 100:
        labels, features =,
        labels = labels.clone().detach().requires_grad_(True).long().to(device)
        # labels = torch.tensor(labels, dtype=torch.long, device=device)
        # compute prediction and prediction error
        pred = bow_model(features)
        # print(pred.dtype, pred.shape)
        loss = loss_fn(pred, labels)
        # print(loss.dtype)
        # backpropagation steps
        # key optimizer steps
        # by default, gradients add up in PyTorch
        # we zero out in every iteration
        # performs the gradient computation steps (across the DAG)
        # adjust the weights
        training_loss_per_epoch += loss.item()
    for batch_number, (labels, features) in enumerate(validation_dataloader):
        if batch_number == 100:
        labels, features =,
        labels = labels.clone().detach().requires_grad_(True).long().to(device)
        #labels, features =,
        #labels = torch.tensor(labels, dtype=torch.float32)
        # compute prediction error
        pred = bow_model(features)
        loss = loss_fn(pred, labels)
        validation_loss_per_epoch += loss.item()
    avg_training_loss = training_loss_per_epoch / train_size
    avg_validation_loss = validation_loss_per_epoch / validation_size
    print(f"Average Training Loss of {epoch}: {avg_training_loss}")
    print(f"Average Validation Loss of {epoch}: {avg_validation_loss}")
def test_loop(bow_model,test_dataloader, epoch, loss_fn=loss_fn):
    test_size = len(test_dataloader.dataset)
    # Failing to do eval can yield inconsistent inference results
    test_loss_per_epoch, accuracy_per_epoch = 0, 0
    # disabling gradient tracking while inference
    with torch.no_grad():
        for labels, features in test_dataloader:
            labels, features =,
            labels = labels.clone().detach().requires_grad_(True).long().to(device)
            # labels = torch.tensor(labels, dtype=torch.long, device=device)
            # labels = torch.tensor(labels, dtype=torch.float32)
            pred = bow_model(features)
            loss = loss_fn(pred, labels)
            test_loss_per_epoch += loss.item()
            accuracy_per_epoch += (pred.argmax(1)==labels).type(torch.float).sum().item()
    print(f"Average Test Loss of {epoch}: {test_loss_per_epoch/test_size}")
    print(f"Average Accuracy of {epoch}: {accuracy_per_epoch/test_size}")

3.8 Training the Model

# it takes a lot of time to run this model
# hence running only for 100 batches (of size 4) in 1 epoch
for epoch in range(epoch_size):
    print(f"Epoch Number: {epoch} \n---------------------")
Epoch Number: 0 
In epoch 0, training of 0 batches are over
In epoch 0, training of 100 batches are over
Average Training Loss of 0: 0.0004964731066373357
Average Validation Loss of 0: 0.008571766301679114
Average Test Loss of 0: 0.12454833071194835
Average Accuracy of 0: 0.8268421052631579
CPU times: user 3h 22min 19s, sys: 13.6 s, total: 3h 22min 32s
Wall time: 6min 14s

3.9.Test the model on sample text

ag_news_label = {1: "World",
                 2: "Sports",
                 3: "Business",
                 4: "Sci/Tec"}

def predict(text, model):
    with torch.no_grad():
        bow_vector = to_bow(text)
        output = bow_model(bow_vector)
        output_label = ag_news_label[output.argmax().item() + 1]
        return output_label
sample_string = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

cpu_model ="cpu")

print(f"This is a {predict(sample_string, model=cpu_model)} news")
This is a Sports news

4. Conclusion

  • In this blog piece, we looked at how bow vectorizer was used as input to build a shallow NN (without non-linear activation function) classification.
  • In the next parts to this Pytorch series, I will cover better ways to build a text classification NN model from scratch


