:earth_americas: Topic Modeling in the Real World

- 5 mins

Topic Modeling in the real world

At Nava, my team is working on an API for Medicare’s Quality Payments Program (QPP). Some of my superhuman teammates support developers integrating with our API through a the QPP APIs Google Group.

Searching for a tough nut to crack, I felt developing an understanding of the QPP APIs google group discussions would be a challenging and valuable project.

Plus - it would be an opportunity for topic modeling in real life!

LDA: A topic modeling methodology for text

I applied topic modeling to QPP APIs Google Group forum text with the goal of understanding what was being discussed in the 239 (and growing daily) forum threads.

Latent Dirichlet Allocation (LDA) is perhaps the most well-known algorithm for topic modeling. LDA is an unsupervised learning algorithm that discovers clusters of words which commonly appear together - in the same “document”. Clusters of these words comprise the topics discovered by the algorithm.

What is a document? It depends on the corpus of text being analyzed. In the case of the QPP APIs Google Group, a document was the content of an single post, whether it be a topic or a response.

How does the algorithm work?

A machine learning algorithm can often be decomposed into the following stages:

1. Data retrieval and preparation

2. Variable setup or initialization (This step often involves generating some random variables.)

3. An update loop (This step often updates the random variables to be less random.)

LDA Step 1: Data retrieval and preparation

LDA Step 2: Initialization

Define the following:

Step 3: Loop!

Re-assign words to topics via collapsed Gibbs sampling.

For i in iterations, d in document_word_indices and w in d do:

The code

I’ve written code for the LDA algorithm using collapsed Gibbs sampling, hosted here. However, in practice, I recommend using sklearn’s LatentDirichletAllocation class.

Is LDA useful?

The general consensus is that, like a lot of data science methods, LDA raises questions more often than it delivers answers. Raising questions is useful, but we are often in search of deeper understanding, so new questions as an outcome can be frustrating. Welcome to data science.

As detailed below, I did find LDA helpful in summarizing QPP APIs Google Groups forums’ content - content I cannot myself read in entirety. However, it is not really a replacement for the content itself.

So what are people discussing in the QPP APIs Google Groups?

I applied LDA to the QPP APIs Google Groups posts and found the following discernable topics:

Topic Summary

Tokens in Topic

The sandbox topic

submissions https qpp cms gov ap sandbox error navapbc endpoint

The scoring topic

value score eligiblepopulation decile null points performancenotmet true performancemet denominator

The measures data topic

measures data measure submit performance cmsgov github json master strata

The data submission topic

submission measurement sets set file request data using use submit

The feedback and communication topic

thanks update team like issue question reply measure feedback lot

The forum / "navahq.com" topic

com http navahq apis google groups forum qpp topic https


rss facebook twitter github youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora