Topic Extraction: Old and New

How do you find thematic clusters in a large corpus of text documents? There are the standared algorithms baked into sklearn: k-means, nonnegative matrix factorization and LDA. But contemporary NLP has largely moved on from bag-of-words representations. Can I get better results with some pretrained transformer models? In this notebook, I’ll be playing around with topic extraction: first with some small language models from huggingface, then with some larger ones from langchain.

import numpy as np
from typing import Optional
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import KMeans
from sklearn.decomposition import TruncatedSVD, NMF, LatentDirichletAllocation
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from transformers import AutoTokenizer, BartForConditionalGeneration, AutoModel
from sklearn.metrics import silhouette_samples, silhouette_score
import torch
from torch.utils.data import DataLoader
import torch.nn.functional as F
from langchain.chat_models import init_chat_model
from langchain_core.messages import HumanMessage, SystemMessage
from pydantic import BaseModel, Field

Fetching the Data

For demonstration purposes, I’ll use a few categories from the standard 20-newsgroups dataset.

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

dataset = fetch_20newsgroups(
    remove=("headers", "footers", "quotes"),
    subset="all",
    categories=categories,
    shuffle=True,
    random_state=42,
)

Some of the documents in the dataset are only a few words; I only want to deal with documents that are least a couple hundred characters.

filtered_text = filter(lambda x: len(x) > 200, (d.strip() for d in dataset.data))
X = list(filtered_text)

It will be convenient to use numpy-style indexing later on.

np_text = np.array(X)

LDA

As a baseline, we can use the standard Latent Dirichlet Allocation model. Each topic has its own distribution over words. Each document has its own distribution over topics. We observe documents as bags of words and do maximum likelihood estimation.

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=5, stop_words="english")
tf = tf_vectorizer.fit_transform(X)
lda = LatentDirichletAllocation(
    n_components=8,
    max_iter=10,
    learning_method="online",
    learning_offset=50.0
)
lda.fit(tf)
LatentDirichletAllocation(learning_method='online', learning_offset=50.0,
                          n_components=8)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
tf_feature_names = tf_vectorizer.get_feature_names_out()
def top_components(terms, m, n=5, k=5):
    ixs = np.argsort(-np.abs(m.components_[:n]), axis=-1)
    return [[(float(a), b) for (a,b) in
          zip(np.round(m.components_[i, ix[:k]], decimals=2), terms[ixs[i, :k]])]
          for (i, ix) in enumerate(ixs)]

This recovers categories for space and graphics like we expect. But we’re not seeing much in the way of atheism or religion.

top_components(tf_feature_names, lda)
[[(62.95, 'radius'),
  (29.93, 'int'),
  (26.16, 'tom'),
  (23.52, 'row'),
  (23.14, 'sphere')],
 [(1079.27, 'space'),
  (385.38, 'earth'),
  (383.02, 'launch'),
  (304.52, 'orbit'),
  (304.04, 'nasa')],
 [(1056.76, 'people'),
  (999.99, 'don'),
  (924.22, 'just'),
  (888.57, 'think'),
  (771.1, 'like')],
 [(801.38, 'jpeg'),
  (596.24, 'image'),
  (463.1, 'file'),
  (428.53, 'gif'),
  (323.77, 'images')],
 [(826.6, 'edu'),
  (728.09, 'graphics'),
  (561.19, 'image'),
  (546.23, 'data'),
  (448.34, 'available')]]

NMF

Another off-the shelf approach is to use nonnegative matrix factorization of tfidf features.

tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=5, stop_words="english")
X_tfidf = tfidf_vectorizer.fit_transform(X)
tfidf_terms = tfidf_vectorizer.get_feature_names_out()
nmf = NMF(n_components=4, alpha_W=0.00005, alpha_H=0.00005, l1_ratio=1).fit(X_tfidf)

Here, we see strong evidence for topics about Christianity, space and images. It’s a little vague, though. For example, we don’t differentiate atheism and theism, and we don’t see anything about computers in the graphics topic.

top_components(tfidf_terms, nmf)
[[(9.43, 'god'),
  (0.53, 'jesus'),
  (0.5, 'bible'),
  (0.43, 'believe'),
  (0.0, '00')],
 [(6.41, 'image'),
  (0.5, 'images'),
  (0.0, '00'),
  (0.0, 'pbmplus'),
  (0.0, 'pbm')],
 [(8.98, 'space'),
  (0.56, 'nasa'),
  (0.0, '00'),
  (0.0, 'pbmplus'),
  (0.0, 'pbm')],
 [(0.9, 'ico'),
  (0.88, 'bobbe'),
  (0.88, 'bronx'),
  (0.87, 'queens'),
  (0.87, 'beauchaine')]]

KMeans via SVD

Rounding out our classical approaches is the venerable k-means algorithm. We’ll use tf-idf features with dimensionality reduced via SVD.

lsa = make_pipeline(TruncatedSVD(n_components=100), Normalizer(copy=False))
X_lsa = lsa.fit_transform(X_tfidf)
lsa[0].explained_variance_ratio_.sum()
np.float64(0.19646096498676613)
kmeans = KMeans(n_clusters=4, n_init=20)
kmeans.fit(X_lsa)
KMeans(n_clusters=4, n_init=20)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
original_space_centroids = lsa[0].inverse_transform(kmeans.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]

This gives pretty good results: we get Christianity, computer graphics, and space! We just can’t tell atheism apart from Christianity.

tfidf_terms[order_centroids[:, :5]]
array([['space', 'launch', 'orbit', 'nasa', 'earth'],
       ['graphics', 'thanks', 'image', 'files', 'software'],
       ['god', 'jesus', 'bible', 'believe', 'christian'],
       ['people', 'think', 'don', 'just', 'like']], dtype=object)

Summarization with Transformers

Instead of describing topics as multinomial distributions of words, let’s see if we can get a transformer model to generate human readable summaries instead!

bart = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn").to("mps")
bart_tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

The examples in the Huggingface documentation encourage you to use the bart-large-cnn model without a decoder prompt, but I find the summaries are better with a little prompting. Browsing the huggingbase codebase shows an undocumented decoder_input_ids option for precisely this purpose.

gtoks = bart_tokenizer("The major themes in this documents are ", truncation=True, return_tensors="pt", max_length=1024)['input_ids'].to('mps')
def summarize(strs):
    toks = bart_tokenizer(strs, padding="longest", truncation=True, return_tensors="pt", max_length=1024)
    expanded = gtoks.expand(toks['input_ids'].shape[0], -1)
    return bart_tokenizer.batch_decode(
        bart.generate(**toks.to("mps"), decoder_input_ids=expanded), skip_special_tokens=True)

Summarized Central Texts

We can start by just summarizing the documents nearest to the center of each cluster.

def most_central(X, kmeans):
    return [np.argmin(((X[kmeans.labels_ == i] - c)**2).sum(axis=-1))
        for i, c in enumerate(kmeans.cluster_centers_)]
centers = most_central(X_lsa, kmeans)
central_texts = [str(np_text[kmeans.labels_ == i][a]) for i, a in enumerate(centers)]
summarize(central_texts)
['The major themes in this documents are atation, space exploration, and life sciences. The author argues that SSF needs to be redesigned to focus on the 3 main functions: microgravity/vacuum process research, life sciences research (adaptation to space) and spacecraft maintenence.',
 'The major themes in this documents are i graphics, raster graphics, and vector graphics. The most recent version of this FAQ is always available on the archive site pit-manager.mit.edu. If you have answers to other frequently asked questions that you would like included in this posting, please send me mail.',
 "The major themes in this documents are a, the love of God, and the power of God's word. The author of the book says he thinks Brian K. has made up his own god and is trying to pass him off as the real thing. He says Brian could be a St. Paul, who mocked Christians as you do, but also had pleasurestoning them.",
 'The major themes in this documents are algorithms, consciousness, and the mind. The author of this article has been asked to explain his view on the nature of a conscious mind. He has responded to a request for his views on the subject. He says that he does not think our brains work like computers, at all.']

Not bad, but pretty document specific.

Multi Document Summarization using LogitsProcessor

Perhaps relying on a single representative document per class is too restrictive. We can let the probability of generating a token given a set of context vectors be the product of the token’s probability in each context. The intuition is that a token is a good choice for describing a cluster if it’s a good choice for describing each document within the cluster individually. This is easy to accomplish with a little abuse of huggingface’s logits_processor argument.

def multiply_logits(input_ids, scores):
    return scores.log_softmax(dim=-1).sum(dim=0, keepdim=True)
def summarize_group(strs, processor=multiply_logits):
    toks = bart_tokenizer(strs, padding="longest", truncation=True, return_tensors="pt", max_length=1024)
    expanded = gtoks.expand(toks['input_ids'].shape[0], -1)
    output = bart.generate(**toks.to("mps"), decoder_input_ids=expanded, num_beams=1, logits_processor=[processor])
    return bart_tokenizer.batch_decode(output[0][None], skip_special_tokens=True)
def top_per_cluster(kmeans, X, k=16):
    return [np_text[kmeans.labels_ == i][np.argsort(((X[kmeans.labels_ == i] - c)**2).sum(axis=-1))[:k]]
        for i, c in enumerate(kmeans.cluster_centers_)]

Alas, this doesn’t seem to work particularly well.

top_lsa = top_per_cluster(kmeans, X_lsa)
[summarize_group([str(a) for a in c])[0] for c in top_lsa]
/Users/sam/extern-dev/transformers/src/transformers/generation/configuration_utils.py:677: UserWarning: `num_beams` is set to 1. However, `early_stopping` is set to `True` -- this flag is only used in beam-based generation modes. You should set `num_beams>1` or unset `early_stopping`.
  warnings.warn(
/Users/sam/extern-dev/transformers/src/transformers/generation/configuration_utils.py:696: UserWarning: `num_beams` is set to 1. However, `length_penalty` is set to `2.0` -- this flag is only used in beam-based generation modes. You should set `num_beams>1` or unset `length_penalty`.
  warnings.warn(





['The major themes in this documents are -based, non-profit, non government, nonprofit, and private. The main themes are: space, space exploration, and space exploration. The first section of the document is titled "The Space-Based Space-based Space- Based Space-Boat" The second section is titled, "The space-based space-boat space-carrier"',
 'The major themes in this documents are i, p, px, and px-p. The main theme is the use of the "P" key to enter a word. The "P-word" is the key to the word "pix"',
 'The major themes in this documents are i, the God-given right to be a god, and the God of the Bible. The author also discusses the role of the church in the world. The book is published by Oxford University Press. The website is free to read.',
 'The major themes in this documents are i, the "world of the gods" and the "God of the universe" The "world" of the god of the devil is the "World of the God of the Universe" The world of theGod oftheWorld is the world of God of all things.']

Retrieval augmented generation does something similar, marginalizing each generated token’s distribution over all possible context documents. Perhaps the same approach could work here?

def marginalize_logits(input_ids, scores):
    return scores.softmax(dim=-1).sum(dim=0, keepdim=True).log()

Nope: results are similarly bad.

[summarize_group([str(a) for a in c], processor=marginalize_logits)[0] for c in top_lsa]
['The major themes in this documents are atmosphere, space, and the universe. The list of topics is broken down into three categories: space, space exploration, and space exploration. The first section of the report is a list of the topics that have been discussed in the past.',
 'The major themes in this documents are i, the PC, and the PC-U. The project is called "The Graphics Resource List" The project was started in 1993 and is still going on today. The site is still open. The website is still live.',
 "The major themes in this documents are i's belief in God and the Bible. The author also believes that God is not a God who can be easily defied. The book is published by Piatkus, a publisher of Christian books. The publisher's website is www.piatkus.com.",
 'The major themes in this documents are -ism, theory of the mind, and the concept of God. The author also discusses the role of religion in the culture of the U.S. and the role that religion plays in the U-S. relationship.']

Neural Embeddings

We can also use transformers for the document embeddings to be clustered with k-means. The following model is another BERT variant fine-tuned for generating embeddings.

minilm_tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
minilm = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2').to('mps')
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
loader = DataLoader(X, batch_size=16)
embeddings = []
for batch in loader:
    toks = minilm_tokenizer(batch, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = minilm(**toks.to('mps'))
        result = F.normalize(mean_pooling(model_output, toks['attention_mask']), p=2, dim=1)
        embeddings.append(result.cpu())
embeddings = torch.cat(embeddings)
torch.save(embeddings, "embeddings.pt")
neural_kmeans = KMeans(n_clusters=4, n_init=20)
neural_kmeans.fit(embeddings)
KMeans(n_clusters=4, n_init=20)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Once again, we can look at the descriptions of the most central documents.

centers = most_central(embeddings, neural_kmeans)
central_texts = [str(np_text[neural_kmeans.labels_ == i][a]) for i, a in enumerate(centers)]
top_embeddings = top_per_cluster(neural_kmeans, embeddings, k=4)
summarize(central_texts)
['The major themes in this documents are atation, space exploration, and life sciences. The author argues that SSF needs to be redesigned to focus on the 3 main functions: microgravity/vacuum process research, life sciences research (adaptation to space) and spacecraft maintenence.',
 'The major themes in this documents are iative, ethical, and moral. The main theme is that there are some actions wrong for all humans in all societies. The author argues that the idea that morality is subjective does not mean that we have to be objective about it.',
 'The major themes in this documents are xionist, anti-atheist, and anti-Mormon. The article was written by Rick Anderson, a member of the Church of Jesus Christ of Latter-day Saints. Anderson: "Of all the "preachers" of "truth" on this net, you have struck me as a self-righteous member of a wrecking crew"',
 'The major themes in this documents are i graphics. This program can let you READ, WRITE and DISPLAY images with different formats. It also let you do some special effects(ROTATION, DITHERING ....) on image. There is no warranty. The author is not responsible for any damage caused by this program.']

Space, ethics, religion and graphics. Not bad. But larger transformers can do a lot better.

LLama Summaries

Time to break out the big guns. We’ll ask llama to analyze each cluster of texts and then decide on a title summarizing them.

llama = init_chat_model("llama-3.3-70b-versatile", model_provider="groq")
class SampleAnalysis(BaseModel):
    analysis: str = Field(description='Analysis of the texts.')
    title: str = Field(description='Title of the cluster.')
def llama_summarize(strs):
    prompt = [SystemMessage("""
Your task is to understand why the given documents were assigned to the same cluster.
- First analyze the documents in the cluster for common topics.
- Then, propose a title for the cluster containing these documents based on the analysis.""")]
    prompt.extend([f"# Document {i}\n{s}" for i,s in enumerate(strs)])
    return llama.with_structured_output(SampleAnalysis).invoke(prompt)
results = [llama_summarize([a[:5000] for a in t[:4]]) for t in top_embeddings]
[r.title for r in results]
['Rethinking Space Exploration and Development',
 'Morality and Societal Mandates',
 'Religious Debates and Critiques',
 'Graphics and Image Processing in MS-DOS']

That’s more like it!

This notebook will continue to grow as I try other aproaches to the problem.

Categories:

Updated: