Application of Machine Learning in Understanding Sentiment of a Conversation
Chatbots are becoming the prime conversational interface for customer engagement and customer service. AI powered chatbots are now able to converse with customers intelligently by extracting intents and entities very precisely and responding to the queries through implementation of advance technologies like NLU (natural Language Understanding) and NLG (Natural Language Generation). Domain specific Chatbots are learning fast through implementation of auto-learning algorithms, such Chatbots going to contribute significantly in optimizing overall cost of service. However at the same time, it is equally important to measure the quality of service provided by a Chatbots to ensure customers are being served effectively. So measurement of aspects like effectiveness of the conversation, understanding customer’s satisfaction level or detecting other hidden aspects from the conversation (like sensing a lead) are going to be the next big addition around marketing and customer engagement through Chatbots and other conversational platforms.
In this article we will explore how sentiment analysis can be implemented and perfected along with other technology stack in tagging or understanding sentiment of a conversation to achieve this aspect of quality for service delivery through Chatbots.
Sentiment analysis or opinion mining is a field within Natural Language Processing (NLP) which identifies / extracts opinion from phrases or conversations. We have used sentiment analysis along with machine learning and other technologies to build this as a platform for understanding conversation effectively.
Following technology stacks / libraries have been used in building the core solution.
While we discuss specifics about application of few technology stacks in greater details at later sections, let’s look at the high level approach adopted to implement the solution. First, we have to train the Neural Network for sentimental analysis, and then integrate the trained model to predict sentiments of a conversation in real-time. Before training the model, the input text would go through Pre-processing steps. In fact the data pre-processing pipeline will be common for both training and prediction.
Let’s first understand what is data pre-processing and the steps involved in it. Data pre-processing is basically cleaning the data from noise and converting the raw data to an algorithm usable format. The different steps involved are:
- Punctuation removal
- Stopwords removal
- Converting to n-grams
Once the Pre-processing steps are over, need to fed this data for training purpose, let’s understand the approach for training of the neural network. The data used for training neural network was a mix set of organizations internally available and tagged dataset along with publicly available tagged dataset (in this case we have used dataset available from IMDB containing movie reviews, a set of 25000 positive reviews and 25000 negative reviews). The dataset is then fed to pre-processing stage where the individual steps mentioned above performs its task and output a list of n-grams token (for our purpose we took n as 1).
Now we have a matrix of m x n, where m is number of columns in our dataset and n is the number of keywords obtained after the data pre-processing steps. These keywords are then passed to word2vec module which output a vector of 200 length (average out vector for all keywords) which is used as our training data (remember, our matrix of training data is now m x 200). This numerical vector of length 200 is fed to the input layer of our neural network, having a hidden layer and output layer with sigmoid as our activation function. 100 epoch are run with a bit of tweaking to get a 89 percent accuracy model (accuracy benchmark on our organizations internal data).
The model was tested on the internal data of organisation and accuracy was 89%. This model is deployed to the production environment which predicts the output of a query and saves the record in our datastore i.e. Elasticsearch which is used for advance searching and matrix generation purpose.
This model can only identify two sentiments i.e. positive or negative which at times is not sufficient to understand the mood of the user. We aim to add Neutral sentiment and also use the similar model for emotion (angry, sad, happy etc.) detection. Also, we will try to improve the accuracy by using more advanced technologies and using more chat based data.
Let’s have a look at the input of individual chat to our module powered by the model developed through the above mentioned steps and the output.
Input : “The watch that I bought was very bad”,
Output : The output mentioned below is stored in our Elasticsearch datastore.
“raw_conversation”: “The watch that I bought was very bad”,
The above output is of that of an individual chat of a longer conversation. The objective is to derive the sentiment of the overall conversation, and to achieve it, individual chat data outputs are grouped using pre-defined rules for the entire conversation and a final outcome is presented.
|CONVERSATION||USER CONVERSATION ANALYSIS|
|AGENT : Hello, How may I asist you?
USER : The mobile phone that I had bought from your store Is not working fine
“raw_conversation”: “The mobile phone that bought from your store Is not working fine”
|AGENT : As I can see in my system that you ordered a phone two days ago which got delivered yesterday. I assure you for the best service.
USER : Yeah, I need a refund for this defective product
“tags”: [ “refund”,”need”,”product”,”defective”],
“raw_conversation”: “I need a refund for this defective product”
|AGENT : Dont worry, we have escalated the matter. A pickup agent will pick the phone and the refund for the same will be processed.
USER : Thank you so much for quick resolution.
AGENT : We are here to help you. Have a great day ahead.
“tags”: [ “quick”,”much”,”resolution”,”Thank”],
“raw_conversation”: “Thank you so much for quick resolution.”
Each of the user interactions of the above conversation is fed to our model and saved separately in Elasticsearch datastore. Results of individual interaction are then grouped using appropriate business logic to have a sentiment of the overall conversation. Example business logic could be – a conversation is positive/negative based on the number / percentage of positive and negative interactions and whichever is higher, the conversation will be tagged accordingly. Likewise one can write business logic using tag words and sentiments together depending on requirement.
Let’s explore the usage of few technologies in achieving the above result.
Elasticsearch mainly plays two roles in this solution:
1. Enhanced searching with fuzziness – The main role of Elasticsearch is to enhance the searching mechanism and generate the statistics for analytics in near real-time. Also it helps to identify the entities which can be generic entities (date, time, place) or it can be user defined entities (capturing of shorthand words and converting it to its full form). Fuzziness helps to identify correct keywords which are incorrect. (eg. delli -> delhi).
2. As a Datastore – It acts as a datastore storing the sentiment as well as pre-processed data inputted to the module. Benefits of saving the pre-processed data helps to perform tag based searching and faster aggregation query execution.
Neural Networks : Neural Network is used as a training algorithm for classifying a query as positive or negative (will implement the third class neutral in next phase). Now what was the benefit of using neural network why not a rule based approach.
Firstly, simple rule based approach has a lot of drawbacks :
1. It doesn’t takes sentence semantics into consideration therefore making decision based on the keywords having part of speech as adjective.
2. It needs a lot of conditions to handle negation, adverbs as well as query containing keywords such as but /else / otherwise.
Now coming to the part why specifically neural network algorithm and why not some other learning algorithm. As we know that Neural Networks are universal approximators and when pumped with right config and semantically preserved numeric vectors (discussed in next point) of training data, neural networks outperforms other training algorithms such as svm, naive bayes, decision tree or random forest. The corpus used was a mix of internally available and others tagged dataset. Excluding sarcasms it gave us an accuracy of 89 percent for two class classification. Keras was the library used for building and training the neural network with backend as Theano.
For the representation of text as numbers, there are many options out there. The simplest methodology when dealing with text is to create a word frequency matrix that simply counts the occurrence of each word. A variant of this method is to estimate the log scaled frequency of each word, but considering its occurrence in all documents (tf-idf). Also another popular option is to take into account the context around each word (n-grams), so that e.g. New York is evaluated as a bi-gram and not separately. However, these methods do not capture high level semantics of text, just frequencies. A recent advance on the field of Natural Language Processing proposed the use of word embeddings. Word embeddings are dense representations of text, coming through a feed-forward neural network. That way, each word is being represented by a point that is embedded in the high-dimensional space. With careful training, words that can be used interchangeably should have similar embeddings. A popular word embeddings network is word2vec. Word2vec is a simple, one-hidden-layer neural network that sums word embeddings and instead of minimizing a multi-class logistic loss (softmax), it minimizes a binary logistic loss on positive and negative samples, allowing to handle huge vocabularies efficiently.
In order to represent the 20Newsgroup documents, we used a pre-trained word2vec model provided by Google. This model was trained on 100 billion words of Google News and contains 300-dimensional vectors for 3 million words and phrases. As a pre-processing, the 20Newsgroups dataset was tokenized and the English stop-words were removed. Empty documents were removed (555 documents deleted). Documents with not at least 1 word in word2vec model were removed (9 documents deleted). The final resulting dataset consists of 18282 documents. For each document, the mean of the embeddings of each word was calculated, so that each document is represented by a 300-dimensional vector.
Hope the article is useful to understand the application of sentiment analysis in tagging conversations. This is just a guideline article, a production deployable solution can be created by taking care of many other aspects as per business need.
- Abheet Gupta
- Rahul Ahuja
- Debmalya De