At any important political or social event, the aftermath of the keynote speech is often entrusted to a Word Cloud. The large size of the most frequently repeated words supposedly conveys the central topics of the speech. Is it really so? Can we really entrust a word cloud to convey, even if only partially, the essence of the speech?
How informative is a word cloud really?
Would I be able to understand my colleague’s speech if it just came in the form of a word cloud?
Would I even be able to identify my preferred song if I could see its word cloud? It is hard to say, without trying.
Let’s Play a Game!
I will show you a number of word clouds built on texts from well-known pop songs from the 80s until now. You are supposed to guess the song and possibly its title. Let’s see how many you can recognize.
Notice that only punctuation was removed from the text, but no other cleaning or stemming procedure was applied. The word clouds thus report all words in their original form as in the song text.
Figure 1. “Guess the Song” game. Can you recognize any of the songs used to generate these word clouds?
Any ideas yet? How many have you identified so far for sure? How many have you got an idea about, but you are not yet completely sure?
The game actually runs on a web browser. A number of 10 songs are randomly extracted from a list.
The text for each extracted song is read, processed, parsed into words, and the corresponding word cloud is generated. One song at a time. The webpage then displays the word cloud and a text box where you are supposed to write the song title if you know it (Figure 1). You have up to three tries to guess the song, after which this attempt turns into a failure.
Either with success or failure, once the time for one song is up, the game moves to the next song. And so on until all 10 songs have been shown.
The last page of the application produces a rough summary of your attempts, with percentages of success vs. failures.
Figure 2. The final page for the “Guess the Song” game. Here you can see the percentages of your failures and successes.
The Application Behind the Game
While you take your time to think, I will show you the application running behind this little game.
KNIME Analytics Platform
This application — for the creation of word clouds from song texts and the collection of answers from users — was developed with the KNIME Analytics Platform.
KNIME Analytics Platform is open-source software for Data Science, covering all your data needs from data ingestion and data blending to data visualization, from Machine Learning algorithms to data wrangling, from reporting to deployment, and more. KNIME Analytics Platform is based on a graphical user interface for visual programming. This makes it very intuitive and easy to use, considerably reducing the learning time. KNIME Analytics Platform has also been designed to be open to different data formats, data types, data sources, data platforms, as well as external tools.
Because of its graphical interface, computing units are represented as small colorful blocks, named “nodes.” Assembling nodes in a pipeline, one after the other, implements a data processing application. The pipeline is called “workflow” (Figure 3).
KNIME Analytics Platform is the core software. However, it can be integrated with a number of packages to work on unstructured data. The KNIME Text Processing extension takes care of all text-based operations, from pure cleaning to keyword extraction, from latent Dirichlet allocation (LDA) implementation to words and frequencies extraction.
A workflow can also be executed from a web browser through the KNIME Server. Dedicated nodes build input and output web components that can be assembled to compose a webpage. Inserting these special nodes within a workflow provides stopping points in the execution flow, where an input is required or an output is displayed.
Given all those characteristics — openness, flexibility, open source, text processing extension, visual programming, and optional web-based execution — we have chosen it to develop our “Guess the Song” game application.
Figure 3. The “Guess the Song” workflow. It starts by reading the list of songs available. The first loop reads the text and extracts the words for each song, one by one. The internal loop builds the word cloud and the webpage for the guess, allowing max. three attempts to the player. Finally, the last metanode builds the landing webpage with the success rate summary.
Reading the Data and Creating the Webpage Template
The workflow starts with a few gray metanodes. Metanodes are gray nodes inside other more basic nodes.
The first metanode in the workflow in Figure 3, named “Prepare Data,” reads the list of available songs and extracts 10 at random to present to the player (Figure 4).
Figure 4. Content of the metanode named “Prepare Data.” You can see a node to read the Excel file with the list of songs, a node to perform the random shuffling, then a node to extract the top 10 songs after shuffling, and finally a column filter node to keep only the necessary input attributes.
Outer Loop to Process All Songs One by One
The light blue nodes delimit a loop. One node starts the loop (the Start node) and one ends the loop (the End node).
Here we have an external loop to process the song texts one by one and an internal loop to allow the player up to a maximum of three guesses.
The outer loop starts from the first song in the random list, reads the text, cleans it from punctuation, and extracts all words together with their relative frequency measure. This all happens inside the metanode named “Text Processing.”
The nodes inside this metanode belong to the KNIME Text Processing extension. You need to install it in KNIME Analytics Platform to continue. Check this YouTube video on how to install KNIME extensions.
Figure 5. Content of the metanode named “Text Processing.” Path of the file containing the song lyrics is passed to a File Reader node, where the song text is read. Then the text is converted to a special Document type, cleaned from punctuation, and standardized to lowercase letters. Finally the Bag of Words node parses the text and extracts all words, and the TF node calculates their relative frequencies.
Inner Loop to Allow a Maximum of Three Guesses for Each Song
Creating the Landing Page with Summary Performance
The metanode named “Guess Summary” concludes the game, both by concluding this workflow and by creating the last landing page.
Did You Guess?
I think you have had enough time by now. Have you guessed? Well, if you haven’t, or if you have and want to know if you’re right, below are the answers.