Topic Detection: What Is It All About?

Topic Detection: What Is It All About?

KNIME LogoKNIME Logoknime logoA classic use case in text analytics is text summarization; that is the art of extracting the most meaningful words from a text document to represent it.

For example, if we have to catalog the tragedy of “Romeo and Juliet” by Shakespeare with let’s say 5 keywords only, “love”, “death”, “young”, “gentlemen”, “quarrel” might be sufficiently descriptive. Of course, the poetry is lost but the main topics are preserved via this keyword-based representation of the text.

Text Summarization

To obtain this kind of text summarization, we can use any of the keyword extraction techniques available in the KNIME TextProcessing extension: Keygraph Keyword Extractor, Chisquare Keyword Extractor, keyword extraction based on TF*IDF frequency measure. For a detailed description of the algorithms behind these techniques, refer to chapter 4 in [1]. Another common way to perform summarization of a text could be to use the LDA (Latent Dirichlet Allocation) algorithm.

The node implementing the LDA algorithm in the KNIME Analytics Platform is the “Topic Extractor (Parallel LDA)” node. For more details on the LDA algorithm and the corresponding node, check chapter 6 in [1].

If we apply the “Topic Extractor (Parallel LDA)” node to the “Romeo and Juliet” tragedy searching for 3 topics each one represented by 10 keywords, we find one dominant topic mainly described by “love”, “death”, “ladi”, “night”, and “thou” and two minor topics described respectively by “gentlemen”, “pretty”, “ladi”, “quarrel” and by “die”, “youth”, “villain”, “slaughter”. The topic’s importance is quantified by its keyword’s weights. If we report those topics’ keywords on a word cloud and we apply their weight to the word size, we easily see that topic 0 in red is the dominant one (Figure 2).

Figure 1. Summarizing the “Romeo and Juliet” tragedy with 10 keywords x 3 topics from a “Topic Extractor (Parallel LDA)“ node.

Note: The Tika Parser node, used here to read the epub document of “Romeo and Juliet” tragedy, is a very versatile node that can read a number of different formats for text documents: from .epub to .pdf, from .pptx to .docx, and many more.

Topic detection 1

Figure 2. The resulting 3 topics – in green, blue, and red – summarizing the “Romeo and Juliet” tragedy. Keyword weight defines keyword size in the word cloud. Note the dominant red topic of “love” and “death”.

Read More: How Better Mobile Onboarding Can Help Banks Target Millennials and Gen Z

Topic Detection

The LDA algorithm actually performs more than just text summarization; it also discovers recurring topics in a document collection. In this case, we talk about topic detection. The LDA algorithm extracts a set of keywords from each text document. Documents are then clustered together to find out the recurring keywords in groups of documents. These sets of recurring keywords can then be considered a topic covering a number of documents in the collection.

The node implementing the LDA algorithm in the KNIME Analytics Platform is the “Topic Extractor (Parallel LDA)” node. For more details on the LDA algorithm and the corresponding node, check chapter 6 in [1].

Where Is Topic Detection Used?

Let’s suppose we have a series of reviews for a product, a restaurant, or a touristic location. By extracting a few topics common to review groups, we can discover the features of the product/restaurant/touristic location that have impressed the reviewers the most.

Let’s suppose we have a confused set of pictures with descriptions. We could reorganize them based on the topics associated with their descriptions.

Let’s suppose we have a number of newspapers each reporting on a set of the news; detecting the common topics helps identify the journal orientation and the trend of the day.

You get the point. I am sure you can come up with a number of similar use cases involving topic detection.

What You Need

To implement a topic detection application via the LDA algorithm, you first need a collection of text documents, not necessarily labeled. LDA is a clustering technique: no labels are necessary. For this example, we have used a collection of 190 news articles from various newspapers in a file news.table. The goal is to label each news with the corresponding topic.

Workflow

The central piece of a workflow implementing topic detection is the LDA node. After preparing the data, we feed the text documents into the LDA node. We then set the LDA node to report n topics (for example 7) and to describe each of these topics with m words (for example 10). Again, a word cloud of the topic keywords, with size proportional to the keyword weights, has been built and can be seen in Figure 3.

Note. Here, simply document-based representation is sufficient. Word/Term extraction, as well as text vectorization here, are not necessary. The Topic Extractor (LDA) node performs all such operations internally.

Topic detection 2

Figure 3. Word cloud of 10 keywords x 7 topics in the news dataset. The journal news all seem to relate to sport and bbq.

Figure 4. Workflow extracting 7 topics – each one described by 10 keywords – on the news dataset with the “Topic Extractor (Parallel LDA)” node.

From the word cloud clearly emerges a topic (in purple) about BBQ recipes, a second topic (in green) about reading sport news, followed by other topics concerning sports and grilling. It is obviously a news collection about outdoor cooking and sport watching.

In the first output table of the LDA node, each news article has been assigned to one of the discovered topics. The news articles are then grouped into 7 topic groups, as was the goal of this small project.

Read More: Conversion-Optimized Websites: A Differentiator in Commercial Real Estate (CRE)

Deployment

There is no deployment for this use case, as there is no LDA Apply node. Indeed, the LDA node works only on one set of text documents. To calculate the distance between a word/keyword-based representation of a document and that of the detected topics, it would be really hard, if not impossible, mainly because of the reduced topic keyword dictionary. It is very unlikely that a new document will contain many – if any – of the few keywords used to describe the topics. That is why deployment, in this case, is the same as detecting the topics on a text collection.

In this use case, we have shown two workflows: one for single text summarization and one for a more general topic detection on a text collection. We have also explained the differences in goals and techniques among the two use cases.

Read More: The 6 Questions You Should Be Asking Your Marketing Analytics Vendor

Picture of Rosaria Silipo

Rosaria Silipo

Rosaria Silipo, Ph.D., principal data scientist at KNIME, is the author of 50+ technical publications, including her most recent book "Practicing Data Science: A Collection of Case Studies". She holds a doctorate degree in bio-engineering and has spent most of her professional life working on data science projects for companies in a broad range of fields, including IoT, customer intelligence, the financial industry, and cybersecurity.

You Might Also Like