Introduction to Statistical Language Modeling

A gentle introduction to understand the ABCs of NLP in the era of Transformer LMs generating poems.
Statistics
Author

Senthil Kumar

Published

March 17, 2020

1. What is a Language Model?

  • A model that generates a probability distribution over sequences of words
  • In simpler words, LM is a model to predict the next word in a sequence of words

2. Applications of LMs

  • Predicting upcoming words or estimating probability of a phrase or sentence is useful in noisy, ambiguous text data sources
    • Speech Recognition: E.g.: P(‘recognize speech’) >> P(‘wreck a nice beach’)
    • Spelling Correction: E.g.: P(‘I have a gub’) << P(‘I have a gun’)
    • Machine Translation: E.g.: P(‘strong winds’) > P(‘large winds’)
    • Optical Character Recognition/ Handwriting Recognition
    • Autoreply Suggestions
    • Text Classification (discussed with python implementation of a simple N-gram model)
    • Text Generation (discussed this with Char-level and Word-level language models)

3. N-gram Language Modelling

Sample Corpus:
> This is the house that Jack built.
> This is the malt
> That lay in the house that Jack built.
> This is the rat,
> That ate the malt
> That lay in the house that Jack built.
> This is the cat,
> That killed the rat,
> That ate the malt
> That lay in the house that Jack built.

Source: “The house that jack built” Nursery Rhyme

What is the probability of house occurring given the before it?

Intuitively, we count,

\[ Prob \ ({ house \over the }) = { Count \ ( the \ house ) \over Count \ (the) } \]

Mathematically, the above formula can be writen as \[ P({ B \over A }) = { P( A, \ B ) \over \ P(B) } \]

What we have computed above is a Bigram Language Model:

\[ Bigram\:Model : { p\,( W_t|W_{t-1} ) } \]

How do ngrams help us calculate the prob of a sentence?: > Sentence, S = ‘A B’
> We know that, probability of this sentence ‘A B’ as,
> P (S) = Prob (A before B) = P(A , B) = Joint Probability of A and B = P(B|A)* P(A)
> > Let us assume a three word sentence, S = ‘A B C’
> P(S) = Prob (A before B before C ) = P (A , B , C)
> = P(C | A , B) * P (A n B)
> = P (C| A , B) * P(B | A) * P (A)

Even if there are more words, we could keep applying Conditional Probability and compute the prob of a sentence. The above rule is called Chain Rule of Probability \[ Prob \ ({ A_n,\ A_{n-1},\ ....,A_1}) = Prob\ ( {A_n \ |\ {A_{n-1},\ ....,A_1}} ) \ \times \ Prob\ (A_{n-1},\ ....,A_1) \]

With four vairables the chain rule of probability is: \[ Prob \ ({ A_4,\ A_3,\ A_2,\ A_1}) = Prob\ ( {A_4 \ |\ {A_{3},\ A_2,\ A_1}} ) \ \times \ Prob\ ( {A_3 \ |\ {A_{2},\ A_1}})\ \times \ Prob\ ({A_2 \ |\ A_1})\ \times \ Prob\ (A_1) \]


4. Out of Vocabulary

Suppose we have a new sentence like the below one:

> This is the dog,
> That scared the cat,
> That killed the rat,
> That ate the malt,
> That lay in the house that Jack built.

If we calculate the Probability of the above sentence, based on the language model trained on the previous sample corpus (The house that Jack built Nursery Rhyme), it would be zero (because “the dog” has not occurred in the LM training corpus.

By our chain rule of probabilities where we keep multiplying probabilities, we would encounter $ P({dog the}) = 0 $ , hence the overall probability of the above example sentence would be zero


One way to avoid, Smoothing!

\[ MLE: P({B\ | \ A}) = {Count\ (\ A,\ B\ ) \over Count\ (\ B\ )} \]

Why is it called MLE? * Suppose a bigram “the dog” occurs 5 times in 100 documents. * Given that we have this 100 document corpus (which the language model represents), the maximum likelihood of the bigram parameter “the dog” appearing in the text is 0.05

Add 1 Smoothing (Laplace Smoothing) \[ P_{smooth}({B\ | \ A}) = {Count\ (\ A,\ B\ )\ +\ 1 \over Count\ (\ B\ )\ +\ |V|} \] * We pretend that each unique bigram occurs once more than it actually did! * Since we have added 1 to each bigram, we have added |V| bigrams in total. Hence normalizing by adding |V| to the denominator * Disadvantage: It reduces the probability of high frequency words

Add- $ $ Smoothing! \[ P_{smooth}({B\ | \ A}) = {Count\ (\ A,\ B\ )\ +\ \delta \over Count\ (\ B\ )\ +\ \delta * |V|} \] * \(\delta\) is any fraction such as 0.1, 0.05, etc., * Unlike Add one smoothing, this solves the zero prob issue and avoids reducing the prob of high frequency words


5. What is Markov Assumption?

So the probability of occurrence of a 4-word sentence, by markov assumption is: \[ Prob \ ({ A,\ B,\ C,\ D}) = Prob\ ({ A_{before}\ B_{before}\ C_{before}\ D}) = Prob\ ( {A \ |\ {B,\ C,\ D}} ) \ \times \ Prob\ ( {B \ |\ {C,\ D}})\ \times \ Prob\ ({C \ |\ D})\ \times \ Prob\ (D) \]

“What I see NOW depends only on what I saw in the PREVIOUS step”

\[ P(w_i,\ |\ {w_{i-1},\ ....,\ w_1}) = P(w_i,\ |\ {w_{i-1}}) \]

Hence the probability of occurrence of a 4-word sentence with Markov Assumption is: \[ Prob \ ({ A,\ B,\ C,\ D}) = Prob\ ({ A_{before}\ B_{before}\ C_{before}\ D}) = Prob\ ( {A \ |\ {B}} ) \ \times \ Prob\ ( {B \ |\ {C}})\ \times \ Prob\ ({C \ |\ D})\ \times \ Prob\ (D) \]

  • What are its advantages?
  • Probability of an entire sentence could be very low but individual phrases could be more probable.
    For e.g.:
    • Actual Data: “The quick fox jumps over the lazy dog”
    • Probability of a new sentence: Prob(“The quick fox jumps over the lazy cat”) = 0 (though probable to occur,‘cat’ is not there in vocab, hence zero)
    • In Markov assumption (with additive smoothing), the above sentence will have a realistic probability

6. Evaluation

Perplexity

  • Perplexity is a measurement of how well a probability model predicts a sample.
  • It is used to compare probability models.
  • A low perplexity indicates the probability distribution is good at predicting the sample

Definition: - Perplexity is the inverse probability of the test set, normalized by the number of words.
Perplexity of test data = PP(test data) = \[ P(w_1,\ w_2,\ ....,\ w_N)^{1 \over N} \] \[ \] \[ {\sqrt[N]{1 \over P(w_1,\ w_2,\ ....,\ w_N)}} \] \[ \] \[ {\sqrt[N]{\prod\limits_{i=1}^{N} {1 \over P(w_i,\ |\ {w_{i-1},\ ....,\ w_1})}}}\]

the lower the perplexity of a LM for predicting a sample, higher is the confidence for that sample to occur in the distribution


Log Probability

  • But multiplying probabilities like these could end up giving you the result closer to zero.

  • For e.g.: P(D|C) = 0.1, P(C|B) = 0.07, P(B|A) = 0.05, P(A) = 0.2

  • P(A before B before C before D before E) = 0.00007 \(-->\) too low or almost zero !

  • Logarithm to the rescue !

  • Logarithm is a monotonically increasing function !

  • Meaning: If P(E|D) > P(D|C), then log(P(E|D)) > log (P(D|C))

  • Also, log (A *B) = log (A) + log (B)

LogarithmChart

\[ \log P(w_1,\ w_2,\ ....,\ w_N) = \sum\limits_{\ i\ =\ 2}^{N} {\log P(w_i\ |\ w_{i-1})}\ +\ \log P(w_1) \] - where N is the number of words in the sentence. - Since log (probabilities) are always negative (see graph: log of values less than 1 is negative), shorter sentences will have higher probability of occurrence.

To normalize for length of sentences, \[ {1 \over N} \log P(w_1,\ w_2,\ ....,\ w_N) = {1 \over N} \left [\ \sum\limits_{\ i\ =\ 2}^{N} {\log P(w_i\ |\ w_{i-1})}\ +\ \log P(w_1) \right] \]

Higher the log probability value of a LM in predicting a sample, higher is the confidence for that sample to occur in the distribution


Perplexity and Log-Probability metrics for a Bigram Markov Language Model

\[ Perplexity\ for\ a\ bigram\ model = {\sqrt[N]{\prod\limits_{i=1}^{N} {1 \over P(w_i,\ |\ {w_{i-1}})}}}\]

Comparing with normalized log probability for a bigram model:
\[ LogProbability\ for\ a\ bigram\ model = {1 \over N} \left [\ \sum\limits_{\ i\ =\ 2}^{N} {\log P(w_i\ |\ w_{i-1})}\ +\ \log P(w_1) \right] \]


7. Application

Text Classification

In this notebook, Markov Bigram model is implemented on a text classification problem. We can build a deterministic classifier where class is attributed to the class with highest log probability score.

print(bigram_markov("economic growth slows down"))
# log probability score of different classes
[('business', -3.9692388514802905),
 ('politics', -6.7677569438805945),
 ('tech', -8.341093396600021),
 ('sport', -9.000286606679415),
 ('entertainment', -9.09610496174359)]

Class attributed: 'business'
print(bigram_markov("prime minister has gone to the US on an official trip"))
# log probability score of different classes
[('politics', -3.5746871267381852),
 ('business', -7.440089097987748),
 ('tech', -10.252850263607193),
 ('sport', -11.981405120960808),
 ('entertainment', -12.124273481509913)]
 
 Class attributed: 'politics'

Text Generation

In this notebook, a char-level LM, using trigrams and bigrams of characters, was used to generate dianosour names.

>> print(names.iloc[:,0].head())

# Names of Dianosours used as input
0     Aachenosaurus
1          Aardonyx
2    Abdallahsaurus
3       Abelisaurus
4     Abrictosaurus
>> char_level_trigram_markov_model(names)

# Names of Dianosours generated as output
broplisaurus
prorkhorpsaurus
jintagasaurus
carsaurus
gapophongasaurus
teglangsaurus
borudomachsaurus
zheisaurus
horachillisaurus
aveiancsaurus

References: