user-icon Kevin Widholm
19. April 2018
timer-icon 8 min

Machine Learning Pipeline for Real-Time Sentiment Analysis

Obtaining feedback from a few of your customers is easy. However, imagine that you could ask the whole world! You would know immediately the general opinion and could react accordingly. In this post, I describe a Machine Learning solution examines Twitter data in real-time for their tonality. Combined with a query for your product or service this yields an up-to-the-minute dashboard of your customer’s satisfaction. A deep neural network classifies whether a tweet is positive or negative, i.e. the tonality. For the implementation I used Python and Google’s deep learning framework TensorFlow.

You can start playing with the hosted complete solution and try out your keywords of interest …

…or read further to start learning Machine Learning! 🙂

Dealing with Big Data

The first requirement for a Machine Learning approach is to collect data from which the model can learn the desired behavior. In addition to the sheer volume of data, the data quality ranks among the main requirements of creating one dataset to train the deep neural net and one dataset to test the training model. The Stanford University provides an appropriate dataset called Sentiment140, in which 1.6 million tweets are already classified by their tonality. Luckily, in this case the step creating the dataset become no longer necessary and I can immediately show you how to deal with that big dataset.

Among the several attributes of the data set, only the sentiment and the text are relevant for our use case. So I filter out all the others in a preprocessing step. Furthermore, for simplification I remove the neutral tweets and I get a binary decision into positive or negative tweets.

Cleaning and formatting is the next step in my pipeline. In this sense, I create something like a lexicon which is a dictionary of words in itself. This refers to as a Bag of Words model, where every single word is identified. Before processing the dataset contains strings made of words, which you can’t feed directly into the network. The described Bag of Words model transforms the strings into vectors with the same length. And we do so because the neural network alone is able to receive the exact same length of vectors with a uniform shape.

In fact the preprocessing step outputs a binary vector for each tweet, which has the length of the lexicon. This example shows how the vector transformation works:

Define the deep neural net

In this step, I create the feedforward neural net using the TensorFlow library in Python. My neural network classifies tweet vectors in negative or positive using the following structure.

  • The input layer has as many neurons as the tweet vector, so every neuron is related to exactly one word in the lexicon. The weighted sum of each neuron is fed through the ReLU activation function.
  • The output layer has two neurons and a softmax activation function, because the result can be either negative or positive.
  • The number of hidden layer and their nodes is customizable, because there is no specific relation to the output nor the input. For classification problem like the sentiment analysis two hidden layers are enough to have an acceptable training result.

The following snippet defines this structure in TensorFlow:

Train the Machine Learning model

Between defining the neural net and training the Machine Learning model there are usually some steps like defining a cost function and an optimizer to minimize the cost. I use cross entropy as the cost function (also called loss). It represents how ‘badly’ the model has classified the tweets. This value should reach a minimum. The neural network is learning when the loss goes down and the accuracy goes up during training.

TensorFlow offers different optimization algorithms, which all have their own characteristics. After some experiments training the network with the AdamOptimizer and training with the GradientDescentOptimizer (included: decaying learning rate) my choice fell on AdamOptimizer. The key reason lies in the steeper increasing AdamOptimizer’s accuracy.

Now it’s time to feed the tweet vectors into a training loop. The network classifies them and returns a classification error. In each iteration I feed 100.000 tweet features into the training loop and get my desired result of the best possible accuracy after 50 epochs.

Let’s take a look into the training loop:

Evaluate the model

To test the quality of the sentiment classification in real-world conditions, I have to use tweets that the system has NOT seen during training. Otherwise, it could learn all the training tweets by heart and still fail at classifying a real tweet. The test dataset contains 10.000 new tweets. In the evaluation step I test our network and get an accuracy of nearly 80 %, which means 8.000 tweets were classified correct.

To optimize the accuracy there are some possibilities like regularization of cost function or the popular dropout method. Dropping units out of the neural network reduces Overfitting and maximizes the accuracy value. As a result I implement Dropout by dropping out my neurons with a probability of 50 % as it can be seen on the neural net model below.

The implementation of dropout is nothing fancy. There are only some additional lines of code in the snippet:

Integrate the model for real-life predictions

One can think of the training process of a Machine Learning Model as a transformation of the training data into a state of the model in which it shows the desired behavior which was encoded into the data. In the case of a neural network, this state simply consists of many weights and parameters which can be stored as an artifact. For using the trained model the weights and parameter will be restored into a network of the same structure.

In this application I aim for a real-time classification to be used in the front-end interactively. Another option would be a batch classification, if you want to analyze stored observations all at once. For the real-time sentiment dashboard  I want to connect the model synchronously with a streaming API or stream processing framework.

From our setup, there were two obvious options to connect to a twitter stream.

Using stream processing framework Apache Flink:

  • Java-based Flink streaming system to initialize a twitter stream
  • Data source: Twitter Application
  • Publish-Subscribe-Messaging-System Apache Kafka to transfer the twitter data in real-time from the Flink system to a Python application, where the Machine Learning model classifies the tweets

Using the Twitter REST API:

  • Twitter REST API to connect to a twitter stream within a Python application
  • Data source: Twitter Application

I decided for the Twitter REST API approach, because there is no transfer between two programming languages required and it’s simpler. For both implementations you need an application token from Twitter in order to request the tweet stream. As most social media content, the tweets are full of special characters and creative notations. The model performs better on cleaner tweets. So I convert the incoming twitter stream into a clean bag of words. After that the classified real-time tweets can be written into a database to collect them in the later visualization.

Sentiment analyze the tweets

Now you know what’s behind the scenes of our Sentiment Analyzer with Machine Learning. Initially, the Machine Learning model was trained and stored. A persistent connection to Twitter’s servers streams the tweets into my Python application. There I clean and classify them without delay into positive and negative polarity .

The resulting classified tweets are saved into a database for the front-end-application to analyze the latest tweets which match the keyword typed in by the user.

As seen on the web app, a real-time sentiment pie chart and line chart visualize the tweet sentiments. The pie chart shows the distribution of positive and negative tweets for the selected keyword.

The line chart provides a timeline of the average sentiment of the last 10 tweets. If the line is above zero the majority of tweets is positive and if the line is below zero the majority has a negative polarity.

Finally you can get an overview about the whole subject with the poster below:

Twitter Sentiment Analyzer with Machine Learning

If there are any questions, I would like to hear from you.

But first I think it’s time to analyze some tweets by clicking on this link:

Comment article