Comparing Two Unsupervised Clustering Algorithms for Text Data

I. Introduction

What algorithms are covered here?

Prior-topic words guided Latent Dirichlet Allocation and
Universal Sentence Encoder (USE) powered K-Means

What questions am I answering here?

Given that USE can encode complex semantic information, is LDA worthwhile?
Can LDA be used for at least some type of data, where it can produce better/on-par results compared to USE-K-Means?

Before diving into the study, let us understand how USE-KMeans and PTW-guided LDA works

II. About USE

IIA. How USE works?

USE converts sentences into 512 embeddings

Semantic Similarity Correlation Matrix

Despite many common words, semantically different sentences will have dissimilar embeddings

IIB. USE Architecture

There are two variants

Variant 1: Transformer Encoder:

source: https://amitness.com/2020/06/universal-sentence-encoder/

What does a Transformer Encoder Layer comprise? - Self Attention Layer - Feed Forward Network Layer

Variant 2: Deep Averaging Network:

IIC. Pre-trained Tasks in USE

Overall Pipeline of how USE is trained:

A. Skip-thought prediction:
- Use Central Sentence to predict both the Previous Sentence and Next Sentence

B. Response Prediction:
- Train the USE architecture to do smart reply prediction

C. Natural Language Inference:
- Do NLI task prediction, where given a premise and a hypothesis, the model is trained to predict whether the hypothesis is an entailment, contradition or neutral to the premise

III. About LDA

Latent:
- Topic structures in a document are latent/hidden in the text
Dirichlet:
- The Dirichlet distribution determines the mixture proportions of
  - the topics in the documents and - the words in each topic.
Allocation:
- Allocation of words to a given topic and allocation of topics to a document

IIIA. Intuitive understanding of Dirichlet Distribution

A Normal/Gaussian distribution is a continuous probability distribution over all the real numbers
- It is described by a mean and a variance.

A Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval

The Poisson distribution is specified by one parameter: lambda (λ). As lambda increases to sufficiently large values, the normal distribution (λ, λ) may be used to approximate the Poisson distribution Source:https://en.wikipedia.org/wiki/Poisson_distribution

Now, what is Dirichlet Distribution?
- The dirichlet distribution is a probability distribution as well
- but it is not sampling from the space of real numbers. Instead it is sampling over a probability simplex

 (0.6, 0.4)
 (0.1, 0.1, 0.8)
 (0.05, 0.2, 0.15, 0.1, 0.3, 0.2)

How Dirichlet Distribution varies w.r.t dirichlet prior:
- The below image shows Dir(α)
- As α increases from 0.05 (1/20) to 0.1, 0.2, 0.4 respectively in plots from left to right & top to down, you can see the distribution becoming more uniform.

IIIB. How LDA Works?

LDA is a generative model
LDA processes documents as ‘bag of words’ – ordering of words is not important

In principle, LDA generates a document based on dirichlet distribution (dd) of topics over documents and dd of words over topics

But we inverse the generative process for statistical inference

IIIC. Hyperparameters of LDA

D = Total No. of Documents
N = No. of Words = Vocab Size
T = No. of Topics

θd = Topic Distribution for a particular document d

Φt= Word Distribution for a topic t. Here for topic 1 and 2.

(colored books represent words/tokens)

Zd,n = Topic Distribution for n th word in document d
Wd,n = nth word in dth document

α= parameter that sets the dircihlet prior on the per-document topic distribution (θ)
= parameter that represents the doc-topic density
= determines the no. of topics in each doc
= (Default) = 1/num_of_topics (in sklearn and gensim)
decreasing alpha results in less number of topics per document
 
β= parameter that sets the dirichlet prior on the per-topic word distribution(ϕ)
= parameter that represents the topic-word density
= determines the no. of words per each topic
= (Default) = 1/num_of_topics (in sklearn and gensim)
decreasing beta results in just a few major words in topics

m = base measure for per-document topic distribution; a simplex vector/array (m1, m2, …, mT) ; sums to one
α = Concentration parameter α (positive scalar) 
 
 α * m = (α1, α2, …., αT)
 
 
n = base measure for per-topic word distribution; a simplex vector/array (n1, n2, …, nN); sums to one
β = Concentration parameter β (positive scalar) 
 
 β * n = (β1, β2, …., βT)
 
 
There is also another hyper parameter η - topic coherence or perplexity - which can be used to determine the number of topics

IIID. Now, How does PTW-LDA work?

Nudge the regular LDA to converge faster and better with human-reviewed words list for each topic

How the topics are seeded with some seed words Source:Freecodecamp.org
Chaotic LDA and Clear PTW_LDA outputs…
LDA might need several hyperparameter tuning attempts to get to the desired splits
Default initialization with uniform word topic distribution

Seeded Initialization
The seeded words are guided towards seeded topics for converging faster

IIID. Pre-processing and Hyper-parameters

Preprocessing:

PTW-LDA	USE + Clustering
1. Stop words removed	No pre-processing; Comments were used as is
2. Lemmatized
3. Top 20 words per (ground truth) label was extracted
4. Human-reviewed list of 20 word lists (each corresponding to 1 topic) were chosen as prior topic words input (SAT is useful when there are prior topic words fed; otherwise it works as normal LDA)

Hyperparameter Tuning

PTW-LDA (best possible based on heuristics and limited # of experimentations)	USE + Clustering
No. Of Topics	No. Of clusters
Max Iterations	Max Iterations (for K-Means)
(default) Doc_topic_prior = alpha = 1/ no_of_topics
(default) topic_word_prior = beta = 1/ no_of_topics
Learning_method: “batch” (whole dataset is used)
(alternative is ‘online’ which uses only batch size no. Of comments; similar to mini_batch_kmeans)
Seeded_words_list
Seed_coefficent/seed confidence (how much to nudge the seeded words)

IV. Comparison Study

Data I - 20 Newsgroups - Supervised Evaluation

Source for Adjusted Mutual Information Score

Data II -ABC (Australian Broadcast Corporation) Corpus - Unsupervised Evaluation

Metric	PTW-LDA	USE-Clustering
Word Embedding based Coherence Score (more the coherence score, better is the clustering output)	Coherence Score for 20 topics: 0.091	Coherence Score for 20 clusters : 0.159
Methodology of computing the above metric	Take the top 10 words that constitute each of the 20 topics (each topic comprises of a probability simplex of the words; select the top 10 highly probable words in that topic) For our case, the top 10 words used for coherence computation in the 10 topics are: [[‘police’, ‘baghdad’, ‘war’, ‘probe’, ‘protest’, ‘anti’, ‘missing’, ‘man’, ‘fight’, ‘coalition’], [‘report’, ‘pm’, ‘death’, ‘korea’, ‘claim’, ‘war’, ‘north’, ‘nt’, ‘toll’, ‘protesters’], [‘win’, ‘govt’, ‘set’, ‘community’, ‘end’, ‘wins’, ‘vic’, ‘indigenous’, ‘road’, ‘help’], [‘world’, ‘cup’, ‘australia’, ‘found’, ‘ban’, ‘plans’, ‘lead’, ‘gets’, ‘expected’, ‘match’], [‘un’, ‘coast’, ‘title’, ‘takes’, ‘peace’, ‘iraq’, ‘gold’, ‘defence’, ‘residents’, ‘play’], [‘iraq’, ‘iraqi’, ‘war’, ‘says’, ‘troops’, ‘killed’, ‘dead’, ‘hospital’, ‘clash’, ‘forces’], [‘council’, ‘boost’, ‘mp’, ‘fire’, ‘group’, ‘qld’, ‘minister’, ‘defends’, ‘land’, ‘welcomes’], [‘man’, ‘court’, ‘charged’, ‘face’, ‘plan’, ‘open’, ‘murder’, ‘urged’, ‘case’, ‘charges’], [‘new’, ‘oil’, ‘dies’, ‘security’, ‘crash’, ‘sars’, ‘high’, ‘year’, ‘house’, ‘car’], [‘water’, ‘rain’, ‘claims’, ‘wa’, ‘nsw’, ‘farmers’, ‘drought’, ‘howard’, ‘centre’, ‘union’]] For each topic, taking 2 out of the top 10 words at a time, compute cosine similarity using pre-trained W2V embedding Overall Coherence is sum of similarity scores of all possible pairs of words for each topic, normalized by the no. of topics	Take the top 10 words that constitute each of the 20 clusters (top 10 words (from stop-words removed set) are computed based on their TF-IDF weighted scores in that cluster) For each cluster, taking 2 out of the top 10 words at a time, compute similarity using pre-trained W2V embedding For our case, the top 10 words used for coherence computation in the 10 clusters are: [[‘win’, ‘cup’, ‘final’, ‘wins’, ‘world’, ‘afl’, ‘coach’, ‘england’, ‘season’, ‘day’], [‘council’, ‘plan’, ‘market’, ‘funding’, ‘boost’, ‘housing’, ‘water’, ‘funds’, ‘budget’, ‘rise’], [‘crash’, ‘man’, ‘killed’, ‘death’, ‘dies’, ‘dead’, ‘injured’, ‘woman’, ‘car’, ‘sydney’], [‘interview’, ‘michael’, ‘business’, ‘abc’, ‘news’, ‘market’, ‘analysis’, ‘david’, ‘extended’, ‘andrew’], [‘australia’, ‘australian’, ‘aussie’, ‘sydney’, ‘australians’, ‘day’, ‘aussies’, ‘australias’, ‘melbourne’, ‘south’], [‘abc’, ‘country’, ‘hour’, ‘news’, ‘weather’, ‘grandstand’, ‘friday’, ‘nsw’, ‘drum’, ‘monday’], [‘govt’, ‘election’, ‘council’, ‘government’, ‘minister’, ‘pm’, ‘parliament’, ‘nsw’, ‘anti’, ‘trump’], [‘police’, ‘man’, ‘court’, ‘murder’, ‘charged’, ‘accused’, ‘death’, ‘guilty’, ‘charges’, ‘assault’], [‘farmers’, ‘water’, ‘drought’, ‘industry’, ‘farm’, ‘coal’, ‘green’, ‘cattle’, ‘mining’, ‘nsw’], [‘health’, ‘hospital’, ‘flu’, ‘mental’, ‘doctors’, ‘treatment’, ‘cancer’, ‘drug’, ‘service’, ‘care’]] Overall Coherence is sum of similarity scores of all possible pairs of words for each cluster, normalized by the no. of clusters

Conclusion

We have used two News corpus of varying text length. One dataset has ground truth labels and the other doesn’t have labels
In our comparison of PTW-LDA vs USE-Clustering
- using both ‘Supervised’ (Adjusted Mutual Information Score) and ‘Unsupervised’ evaluation metrics, USE-clustering performs far superior to PTW-LDA despite repeated attempts at different set of hyper-parameters for PTW-LDA

Important References:
USE:
- https://amitness.com/2020/06/universal-sentence-encoder/
- USE Paper: https://arxiv.org/abs/1803.11175

LDA:
- The original paper of LDA by David Blei - https://www.slideshare.net/hustwj/nicolas-loeff-lda - http://videolectures.net/mlss09uk_blei_tm/ - http://www.cs.columbia.edu/~blei/talks/Blei_ICML_2012.pdf - https://www.quora.com/What-is-an-intuitive-explanation-of-the-Dirichlet-distribution (it compares normal distribution with Dirichlet Distribution) - https://ldabook.com/lda-as-a-generative-model.html - https://ldabook.com/index.html