Topic Detection: What Is It All About?

PlayTV.ai, a Revolutionary AI Platform That Instantly Translates YouTube Channels for Global Audiences in any language

Centerfield Acquires Digital Commerce Platform ConsumerVoice

White Peak launches MaximizeOS to Help Businesses win in the AI era

Topic Detection: What Is It All About?

Rosaria Silipo

October 22, 2019

A classic use case in text analytics is text summarization; that is the art of extracting the most meaningful words from a text document to represent it.

For example, if we have to catalog the tragedy of “Romeo and Juliet” by Shakespeare with let’s say 5 keywords only, “love”, “death”, “young”, “gentlemen”, “quarrel” might be sufficiently descriptive. Of course, the poetry is lost but the main topics are preserved via this keyword-based representation of the text.

Text Summarization

To obtain this kind of text summarization, we can use any of the keyword extraction techniques available in the KNIME TextProcessing extension: Keygraph Keyword Extractor, Chisquare Keyword Extractor, keyword extraction based on TF*IDF frequency measure. For a detailed description of the algorithms behind these techniques, refer to chapter 4 in [1]. Another common way to perform summarization of a text could be to use the LDA (Latent Dirichlet Allocation) algorithm.

The node implementing the LDA algorithm in the KNIME Analytics Platform is the “Topic Extractor (Parallel LDA)” node. For more details on the LDA algorithm and the corresponding node, check chapter 6 in [1].

If we apply the “Topic Extractor (Parallel LDA)” node to the “Romeo and Juliet” tragedy searching for 3 topics each one represented by 10 keywords, we find one dominant topic mainly described by “love”, “death”, “ladi”, “night”, and “thou” and two minor topics described respectively by “gentlemen”, “pretty”, “ladi”, “quarrel” and by “die”, “youth”, “villain”, “slaughter”. The topic’s importance is quantified by its keyword’s weights. If we report those topics’ keywords on a word cloud and we apply their weight to the word size, we easily see that topic 0 in red is the dominant one (Figure 2).

Figure 1. Summarizing the “Romeo and Juliet” tragedy with 10 keywords x 3 topics from a “Topic Extractor (Parallel LDA)“ node.

Note: The Tika Parser node, used here to read the epub document of “Romeo and Juliet” tragedy, is a very versatile node that can read a number of different formats for text documents: from .epub to .pdf, from .pptx to .docx, and many more.

Figure 2. The resulting 3 topics – in green, blue, and red – summarizing the “Romeo and Juliet” tragedy. Keyword weight defines keyword size in the word cloud. Note the dominant red topic of “love” and “death”.

Topic Detection

The LDA algorithm actually performs more than just text summarization; it also discovers recurring topics in a document collection. In this case, we talk about topic detection. The LDA algorithm extracts a set of keywords from each text document. Documents are then clustered together to find out the recurring keywords in groups of documents. These sets of recurring keywords can then be considered a topic covering a number of documents in the collection.

Where Is Topic Detection Used?

Let’s suppose we have a series of reviews for a product, a restaurant, or a touristic location. By extracting a few topics common to review groups, we can discover the features of the product/restaurant/touristic location that have impressed the reviewers the most.

Let’s suppose we have a confused set of pictures with descriptions. We could reorganize them based on the topics associated with their descriptions.

Let’s suppose we have a number of newspapers each reporting on a set of the news; detecting the common topics helps identify the journal orientation and the trend of the day.

You get the point. I am sure you can come up with a number of similar use cases involving topic detection.

What You Need

To implement a topic detection application via the LDA algorithm, you first need a collection of text documents, not necessarily labeled. LDA is a clustering technique: no labels are necessary. For this example, we have used a collection of 190 news articles from various newspapers in a file news.table. The goal is to label each news with the corresponding topic.

Workflow

The central piece of a workflow implementing topic detection is the LDA node. After preparing the data, we feed the text documents into the LDA node. We then set the LDA node to report n topics (for example 7) and to describe each of these topics with m words (for example 10). Again, a word cloud of the topic keywords, with size proportional to the keyword weights, has been built and can be seen in Figure 3.

Note. Here, simply document-based representation is sufficient. Word/Term extraction, as well as text vectorization here, are not necessary. The Topic Extractor (LDA) node performs all such operations internally.

Figure 3. Word cloud of 10 keywords x 7 topics in the news dataset. The journal news all seem to relate to sport and bbq.

Figure 4. Workflow extracting 7 topics – each one described by 10 keywords – on the news dataset with the “Topic Extractor (Parallel LDA)” node.

From the word cloud clearly emerges a topic (in purple) about BBQ recipes, a second topic (in green) about reading sport news, followed by other topics concerning sports and grilling. It is obviously a news collection about outdoor cooking and sport watching.

In the first output table of the LDA node, each news article has been assigned to one of the discovered topics. The news articles are then grouped into 7 topic groups, as was the goal of this small project.

Deployment

There is no deployment for this use case, as there is no LDA Apply node. Indeed, the LDA node works only on one set of text documents. To calculate the distance between a word/keyword-based representation of a document and that of the detected topics, it would be really hard, if not impossible, mainly because of the reduced topic keyword dictionary. It is very unlikely that a new document will contain many – if any – of the few keywords used to describe the topics. That is why deployment, in this case, is the same as detecting the topics on a text collection.

In this use case, we have shown two workflows: one for single text summarization and one for a more general topic detection on a text collection. We have also explained the differences in goals and techniques among the two use cases.

PlayTV.ai, a Revolutionary AI Platform That Instantly Translates YouTube Channels for Global Audiences in any language

August 22, 2025

Centerfield Acquires Digital Commerce Platform ConsumerVoice

August 22, 2025

White Peak launches MaximizeOS to Help Businesses win in the AI era

August 22, 2025

Colle AI Unveils Real-Time Prototyping Tools to Simplify NFT Creation

August 22, 2025

Why Advertisers Should Treat Affiliate Like an Advertising Channel

August 21, 2025

The Battle for Travel’s AI-Powered Future

August 14, 2025

If Personal AI Agents Can Reshape Adtech, Surely We Can Accelerate the Payment Chain, Too?

August 13, 2025

AI-Driven CX: Navigating the Future of Customer Trust and Experience in 2025

July 31, 2025

Rosaria Silipo

Rosaria Silipo, Ph.D., principal data scientist at KNIME, is the author of 50+ technical publications, including her most recent book "Practicing Data Science: A Collection of Case Studies". She holds a doctorate degree in bio-engineering and has spent most of her professional life working on data science projects for companies in a broad range of fields, including IoT, customer intelligence, the financial industry, and cybersecurity.

You Might Also Like

PlayTV.ai, a Revolutionary AI Platform That Instantly Translates YouTube Channels for Global Audiences in any language

White Peak launches MaximizeOS to Help Businesses win in the AI era

Topic Detection: What Is It All About?

Rosaria Silipo

Text Summarization

Topic Detection

Where Is Topic Detection Used?

What You Need

Workflow

Deployment

Popular Posts

PlayTV.ai, a Revolutionary AI Platform That Instantly Translates YouTube Channels for Global Audiences in any language

White Peak launches MaximizeOS to Help Businesses win in the AI era

Colle AI Unveils Real-Time Prototyping Tools to Simplify NFT Creation

GUEST ARTICLES

Why Advertisers Should Treat Affiliate Like an Advertising Channel

If Personal AI Agents Can Reshape Adtech, Surely We Can Accelerate the Payment Chain, Too?

AI-Driven CX: Navigating the Future of Customer Trust and Experience in 2025

Follow Us

Rosaria Silipo

You Might Also Like

PlayTV.ai, a Revolutionary AI Platform That Instantly Translates YouTube Channels for Global Audiences in any language

White Peak launches MaximizeOS to Help Businesses win in the AI era

About Us

Quick Links

Visit Our Other Sites

Follow Us

Interested in our Customized Editorial Services?