Thursday, 4 October 2018

NLP and LSTM for Classifying Customer Complaints about Financial Products


Some of this work is available as python code at https://github.com/faraway1nspace/NLP_topic_embeddings

Topic Embeddings

One of the most interesting concepts to arise out of machine learning in the last decade has been the concept of "Embeddings" (see, for example, "word2vec" developed by Tomas Mikolov at Google). The general idea is to transform categories from discrete, independent objects and "embed" them into a low-dimensional vector space. Each object/category is no longer it's own discrete entity, but is a vector in some space. In the word2vec example, each English word is a vector of size 300, such that variations along each of the 300 dimensions has inferred semantic importance (e.g,. "girl" vs. "boy" vary along a dimension, just as as "queen" vs. "king"). Another example is the famous instacart model, where the analysts sought to embed 10 million grocery-items into 10 dimensions. For my Insight Data project, I sought an embedding for customer-feedback topics from a propriety customer survey dataset (e.g., to organize the various things customers complain about).

From the perspective of a data-analyst, the immediate utility of the embedding approach is as an alternative means of vectorizing categorical variables, and, in particular, finding a vectorization that is low-dimensional. Consider the traditional route of vectorizing categorical data via "One-hot-coding": each category becomes a binary dummy variable in a matrix of size K-1. In contrast, the embedding approach uses deep-learning to vectorize each category into continuous latent vectors, whereby the embedding dimension D is typically much lower than the number of categories D less than K. In my case, I was dealing with about 200 consumer-feedback categories (and growing exponentially!), while my embedding space just had 6 dimensions (see Fig 1) below.
Fig1: Example embedding: various consumer-complaint categories vectorized into 6 dimensions.

Why all the fuss? In each discipline, there is a flurry creative uses for the embedding technology. For my use-case, I was interested in embeddings for two applications:
  • finding redundant categories; and
  • grouping categories into meaningful super-groups (clustering).
For the first goal, it is possible to find redundant/related categories according to the following intuition: vectors that are close in the embedding space are strongly related, or even redundant. We can cluster these putative different categories into super-categories through clustering algorithms, like K-means.

I will walk through a particular problem which invites embedding solution, and another which perhaps does not benefit from embeddings.

Example 1: Multi-class Classification

The following example comes from a propriety dataset, so the context and results are all general, and the code cannot be shared. Nonethless, the results are interesting for two reasons: the use of embeddings, and the pivotal role of the multi-class loss.

Goal

While studying ML & data science with Insight Data 2018, I played consultant for a local startup company in Toronto. They had a large data-set of customer feedback in text form. Each row of text was a customer complaint or recommendation regarding a large variety of products, companies and shopping experiences. They employed an army of humans to go through each text feedback and extract customer sentiments (-,0,+). The problem was that these categories/labels were increasing exponentially in time: the number of labels was growing from 50 to 200 (to perhaps 1000?) in a matter of months. How could I scale up this classification & sentiment analysis?

Fig 2: Left: Exponential increase in the number of labels assigned to the customer-feedback text. Right: extreme skew to the distribution of labels in the data: most are rare (perhaps redundant?) a few are well represented.


The scaling issue became very apparent after trying a generic 'go-to' model with a simple natural language processing (NLP) component and deep-learning model. Using python, tensorsflow, and the keras API, the 'go-to' model had the following pipeline: pre-process the text (stemming words, remove stopwords, etc.); vectorize the words of the text with a word-embedding (like word2vec, but trained within the context of the this problem); run the word-vectors through a recurrent neural network (e.g., LSTM); finally, use Multi-class Classification (and conditional sentiment analysis) to assign category-labels to each customer-review.

Problem

The go-to model performed well for the most abundant 20 categories, but it didn't scale well for entire dataset of >200 categories. My worry was that the exponential increase in the number of categories meant that there was considerable redundancy in the ever-increasing quantity of (putative) categories.

Solution: Embedding

Inspired by Instacart and the lessons at fast.ai, I thought it might be helpful to try and reduce the number of "effective" categories through embeddings. Basically, the algorithm learns an embedding space, different regions of the space have "meaning", and each category is merely a that point exists in this space. Related Categories like "free samples" and "sign-up incentives" are close together in this space, while categories like "website layout" and "shipping costs" occupy a different part of the space.

The neural architecture is shown in figure 3. And the keras API looks like:

# INPUTS and OUPUTS
# X_train: the vectorized text data (training set)
# Y_train: is a 3D tensor of targets of conditional targets: 
# ... : the rows represent observation
# ... : the cols represent categories
# ... : the slices represent the sentiments per category [NA,-,0,+]
# X_topics_train : just a vector of integers 1:N_labels
# W_train : sample-weights for unbalanced design

embed_dim_lstm = 194 # word-embedding dimensions
lstm_OutputDim = 100 # output dimensions of the LSTM
embed_dim_topic = 6 # category embedding dimensions
batch_size = 128 # mini-batch size
hidden_nodes_final = (np.linspace(lstm_OutputDim,Ymultclass.shape[1],3).round().astype(int))[1] # dimensions of the final hidden layer

lstm_input_layer = keras.layers.Input(shape=(X_train.shape[1],), dtype='int32', name='lstm_input',) # lstm input
lstm_embed_layer = keras.layers.Embedding(input_dim=max_features, output_dim=embed_dim_lstm, input_length=X_train.shape[1])(lstm_input_layer) # input_dim = vocab size,
lstm_output = keras.layers.LSTM(lstm_OutputDim)(lstm_embed_layer) # the output has dimension (None, 12)
# need to Repeat the LSTM output from a 2D matrix to a 3D tensor in order to concatenate to the category-embedding
lstm_reshape = keras.layers.RepeatVector(nTopics)(lstm_output)
topic_input_layer = keras.layers.Input(shape=(X_topics_train.shape[1],), dtype='int32', name='topic_input') # input the topics
topic_embed_layer = keras.layers.Embedding(input_dim=nTopics, output_dim = embed_dim_topic, input_length=X_topics_train.shape[1])(topic_input_layer) # topic embedding
# need to reshape
x = keras.layers.concatenate([lstm_reshape,topic_embed_layer],axis=2) # is this axis 1 or 2??
hidden_layer = keras.layers.Dense(hidden_nodes_final, activation='relu')(x)
#x = keras.layers.BatchNormalization()(x)
main_output = keras.layers.Dense(4, activation='softmax', name='main_output')(hidden_layer) # main output for categories
model = keras.models.Model(inputs=[lstm_input_layer,topic_input_layer], outputs=[main_output])
model.compile(loss = "categorical_crossentropy", optimizer='adam',metrics = ['accuracy'])
print(model.summary()) # I guess I can use both
model.fit({'lstm_input': X_train, 'topic_input': X_topics_train }, # inputs
          {'main_output': Y_train}, # targets
          epochs = 50, # 
          batch_size=batch_size, verbose = 2,
          validation_data=([X_val, X_topics_val], Y_val), # validation data
          sample_weight = W_train) # sample-weights for unbalanced design

After compiling the model, the keras API returns the following list of layers and their shape.

______________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
lstm_input (InputLayer)         (None, 198)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 198, 194)     582000      lstm_input[0][0]                 
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 100)          118000      embedding_1[0][0]                
__________________________________________________________________________________________________
topic_input (InputLayer)        (None, 145)          0                                            
__________________________________________________________________________________________________
repeat_vector_1 (RepeatVector)  (None, 145, 100)     0           lstm_1[0][0]                     
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 145, 6)       870         topic_input[0][0]                
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 145, 106)     0           repeat_vector_1[0][0]            
                                                                 embedding_2[0][0]                
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 145, 122)     13054       concatenate_1[0][0]              
__________________________________________________________________________________________________
main_output (Dense)             (None, 145, 4)       492         dense_1[0][0]                    
==================================================================================================
Total params: 714,416
Trainable params: 714,416
Non-trainable params: 0
__________________________________________________________________________________________________
None

Figure 3: Diagram of the above Keras API model. The top arm is a generic text-classification model (word-tokens -> word embedding -> LSTM), while the bottom arm includes the "category embeddings". The outputs of the LSTM and the category-embeddings are concatenated before running through a final Dense layer. The final targets are multi-class labels (x-axis) and their conditional sentiments (NA,-,0,+) along the z-axis

According to the out-of-sample AUC statistics, the model performed quite well (>0.95). What is especially interesting is the learned relationships among the different categories, as visualized in figure 4 (below). The fact that related categories cluster together in space demonstrates that the model does indeed learning meaningful embeddings of the categories.

Fig 4: Consumer feedback categories and their learned position in a low-dimensional embedding space (here, visualized in 2D with t-SNE. 
There are a couple interesting things to note about the neural architecture:

  1. the 'generic' LSTM model is still present in this embedding model: it merely represents the chain of layers: `lstm_input_layer`, `lstm_embed_layer`, `lstm_output` (representing the word-embedding and the LSTM layers).
  2. the category-embedding goes through the chain: `topic_input_layer`,  `topic_embed_layer`
  3. in order to combine outputs of the LSTM chain with the outputs from the category-embedding chain, I had to repeat the outputs of the LSTM so that it had the same dimensions as the category-embedding in the X and Y axes (i.e., concatenating along the Z axis): `lstm_reshape = keras.layers.RepeatVector(nTopics)(lstm_output)`

Loss function:  Importance of the Multi-class loss


Notice also, that both the final layer (`hidden_layer`) and the Y targets are 3D dimensional. This has a very important implication for the `categorical_crossentropy` loss function: the unit-vector which sums to 1 is taken along the z-axis; in the model specification, this z-axis represents the 4 different sentiment categories per a particular category: [NA = not-present; negative; neutral; positive]. The columns are just different possible categories, each with their own sentiment-unit-vector heading off in the Z-axis direction. This is summarized in the figure 5. This has two important points:

  • First, the entire analysis is "multi-class" classification, such that each piece of text (a row in the matrix) can have multiple labels (which makes sense, because people can and do talk about different things in their customer feedback.
  • Second, the rows (different texts) and columns (different categories) are essentially treated as independent observations, according to the loss function!

Figure 5: structure of data-targets and the loss function. In Keras, if the targets is 3D (here, text on y-axis, categories on x-axis, and conditional sentiments on z-axis), and you choose a "categorical_crossentropy" loss function, then the distribution is assumed multinomial over the Z-axis. This means, the probability of each element in the z-axis sums to one (sentiments) while the columns and rows are all independent.


Therefore, we see why the embeddings may help the predictive performance of the model, and why the model is able to learn meaningful embeddings of the categories: were the model not to learn the associations among the categories (via the embeddings), then each category would contribute an independent amount to the loss function, irrespective of all the other categories, per sentence. With the embeddings, the model is learning a structure that helps it predict which columns/categories are related.

The specification of this loss function (such that each category is independent in the eyes of the loss function) is the crux of the embeddings usefulness. In the next section, I will do another seemingly similar analysis according to a different loss function, with very different results.

Example 2: Multinomial Classification

Due to the secretative nature of company I was working with, I wanted to show an alternative analysis using publicly available data for text-classification.  I settled on the USA Consumer Financial Protection Bureau's financial complaint database


No comments:

Post a Comment