Topic Modeling in the Real World
- 5 minsTopic Modeling in the real world
At Nava, my team is working on an API for Medicare’s Quality Payments Program (QPP). Some of my superhuman teammates support developers integrating with our API through a the QPP APIs Google Group.
Searching for a tough nut to crack, I felt developing an understanding of the QPP APIs google group discussions would be a challenging and valuable project.
Plus - it would be an opportunity for topic modeling in real life!
LDA: A topic modeling methodology for text
I applied topic modeling to QPP APIs Google Group forum text with the goal of understanding what was being discussed in the 239 (and growing daily) forum threads.
Latent Dirichlet Allocation (LDA) is perhaps the most well-known algorithm for topic modeling. LDA is an unsupervised learning algorithm that discovers clusters of words which commonly appear together - in the same “document”. Clusters of these words comprise the topics discovered by the algorithm.
What is a document? It depends on the corpus of text being analyzed. In the case of the QPP APIs Google Group, a document was the content of an single post, whether it be a topic or a response.
How does the algorithm work?
A machine learning algorithm can often be decomposed into the following stages:
1. Data retrieval and preparation
2. Variable setup or initialization (This step often involves generating some random variables.)
3. An update loop (This step often updates the random variables to be less random.)
LDA Step 1: Data retrieval and preparation
- Retieve a corpus (or collection) of documents. Tokenize and otherwise process the documents to achieve the most meaningful results. For example, this step may involve removing numeric characters.
- Generate a bag of words matrix. The matrix dimensions are
D x W
whereD
is the number of documents andW
is the number of unique tokens. This matrix serves to generate a vocabulary index for each unique token in the corpus.
LDA Step 2: Initialization
Define the following:
- Hyperparameters
-
k
- A user-defined number of topics. A magic hyperparameter, in theory it represents the number of topics underlying the corpus’s generative process. -
alpha
andeta
- hyperparameters
-
-
document_vocabulary_indices
: A D-length array with variable-length inner arrays. Each inner array is a list of integers representing the vocabulary index for each token in the document. -
topic_assignments
: A D-length array with variable-length inner arrays. Each inner array is a list of integers representing the document token’s topic assignment. -
document_topic_counts
: A D x K matrix having counts of tokens assigned to topic k for document d. -
word_topic_counts
: A W x K matrix having counts of tokenw
appear in topick
.
Step 3: Loop!
Re-assign words to topics via collapsed Gibbs sampling.
For i
in iterations, d
in document_word_indices
and w
in d
do:
-
w_i
- vocabulary index for token ind
current_topic_assignment = topic_assignments[d][w_i]
- Decrement count for
d,current_topic_assignment
indocument_topic_counts
- Decrement count for
w_i,current_topic_assignment
inword_topic_counts
- Calculate
a = p(topic t | document d)
: proportion of words in documentd
that are assigned to topict
- Calculate
b = p(word w| topic t): proportion of assignments to topic t, over all documents d, that come from word w
- Calculate the K probabilities that w belongs to each k topic (
p_z = b * a
) -
new_topic_assignment
is a sample from multinomial with K probabilitiesp_z/sum(p_z)
- Re-assign
topic_assignments[d,token_index] = new_topic_assignment
- Increment counts for
new_topic_assignment
indocument_topic_counts
andword_topic_counts
The code
I’ve written code for the LDA algorithm using collapsed Gibbs sampling, hosted here. However, in practice, I recommend using sklearn’s LatentDirichletAllocation class.
Is LDA useful?
The general consensus is that, like a lot of data science methods, LDA raises questions more often than it delivers answers. Raising questions is useful, but we are often in search of deeper understanding, so new questions as an outcome can be frustrating. Welcome to data science.
As detailed below, I did find LDA helpful in summarizing QPP APIs Google Groups forums’ content - content I cannot myself read in entirety. However, it is not really a replacement for the content itself.
So what are people discussing in the QPP APIs Google Groups?
I applied LDA to the QPP APIs Google Groups posts and found the following discernable topics:
Topic Summary |
Tokens in Topic |
The sandbox topic |
submissions https qpp cms gov ap sandbox error navapbc endpoint |
The scoring topic |
value score eligiblepopulation decile null points performancenotmet true performancemet denominator |
The measures data topic |
measures data measure submit performance cmsgov github json master strata |
The data submission topic |
submission measurement sets set file request data using use submit |
The feedback and communication topic |
thanks update team like issue question reply measure feedback lot |
The forum / "navahq.com" topic |
com http navahq apis google groups forum qpp topic https |