*Some of this work is available as python code at*https://github.com/faraway1nspace/NLP_topic_embeddings

## Topic Embeddings

One of the most interesting concepts to arise out of machine learning in the last decade has been the concept of "Embeddings" (see, for example, "word2vec" developed by Tomas Mikolov at Google). The general idea is to transform categories from discrete, independent objects and "embed" them into a low-dimensional vector space. Each object/category is no longer it's own discrete entity, but is a vector in some space. In the word2vec example, each English word is a vector of size 300, such that variations along each of the 300 dimensions has inferred semantic importance (e.g,. "girl" vs. "boy" vary along a dimension, just as as "queen" vs. "king"). Another example is the famous instacart model, where the analysts sought to embed 10 million grocery-items into 10 dimensions. For my Insight Data project, I sought an embedding for customer-feedback topics from a propriety customer survey dataset (e.g., to organize the various things customers complain about).From the perspective of a data-analyst, the immediate utility of the embedding approach is as an alternative means of vectorizing categorical variables, and, in particular, finding a vectorization that is low-dimensional. Consider the traditional route of vectorizing categorical data via "One-hot-coding": each category becomes a binary dummy variable in a matrix of size K-1. In contrast, the embedding approach uses deep-learning to vectorize each category into continuous latent vectors, whereby the embedding dimension D is typically much lower than the number of categories D less than K. In my case, I was dealing with about 200 consumer-feedback categories (and growing exponentially!), while my embedding space just had 6 dimensions (see Fig 1) below.

Fig1: Example embedding: various consumer-complaint categories vectorized into 6 dimensions. |

Why all the fuss? In each discipline, there is a flurry creative uses for the embedding technology. For my use-case, I was interested in embeddings for two applications:

- finding redundant categories; and
- grouping categories into meaningful super-groups (clustering).

I will walk through a particular problem which invites embedding solution, and another which perhaps does not benefit from embeddings.

## Example 1: Multi-class Classification

The following example comes from a propriety dataset, so the context and results are all general, and the code cannot be shared. Nonethless, the results are interesting for two reasons: the use of embeddings, and the pivotal role of the multi-class loss.### Goal

While studying ML & data science with Insight Data 2018, I played consultant for a local startup company in Toronto. They had a large data-set of customer feedback in text form. Each row of text was a customer complaint or recommendation regarding a large variety of products, companies and shopping experiences. They employed an army of humans to go through each text feedback and extract customer sentiments (-,0,+). The problem was that these categories/labels were increasing exponentially in time: the number of labels was growing from 50 to 200 (to perhaps 1000?) in a matter of months. How could I scale up this classification & sentiment analysis?The scaling issue became very apparent after trying a generic 'go-to' model with a simple natural language processing (NLP) component and deep-learning model. Using python, tensorsflow, and the keras API, the 'go-to' model had the following pipeline: pre-process the text (stemming words, remove stopwords, etc.); vectorize the words of the text with a word-embedding (like word2vec, but trained within the context of the this problem); run the word-vectors through a recurrent neural network (e.g., LSTM); finally, use Multi-class Classification (and conditional sentiment analysis) to assign category-labels to each customer-review.

### Problem

The go-to model performed well for the most abundant 20 categories, but it didn't scale well for entire dataset of >200 categories. My worry was that the exponential increase in the number of categories meant that there was considerable redundancy in the ever-increasing quantity of (putative) categories.### Solution: Embedding

Inspired by Instacart and the lessons at fast.ai, I thought it might be helpful to try and reduce the number of "effective" categories through embeddings. Basically, the algorithm learns an embedding space, different regions of the space have "meaning", and each category is merely a that point exists in this space. Related Categories like "free samples" and "sign-up incentives" are close together in this space, while categories like "website layout" and "shipping costs" occupy a different part of the space.The neural architecture is shown in figure 3. And the keras API looks like:

# INPUTS and OUPUTS # X_train: the vectorized text data (training set) # Y_train: is a 3D tensor of targets of conditional targets: # ... : the rows represent observation # ... : the cols represent categories # ... : the slices represent the sentiments per category [NA,-,0,+] # X_topics_train : just a vector of integers 1:N_labels # W_train : sample-weights for unbalanced design embed_dim_lstm = 194 # word-embedding dimensions lstm_OutputDim = 100 # output dimensions of the LSTM embed_dim_topic = 6 # category embedding dimensions batch_size = 128 # mini-batch size hidden_nodes_final = (np.linspace(lstm_OutputDim,Ymultclass.shape[1],3).round().astype(int))[1] # dimensions of the final hidden layer lstm_input_layer = keras.layers.Input(shape=(X_train.shape[1],), dtype='int32', name='lstm_input',) # lstm input lstm_embed_layer = keras.layers.Embedding(input_dim=max_features, output_dim=embed_dim_lstm, input_length=X_train.shape[1])(lstm_input_layer) # input_dim = vocab size, lstm_output = keras.layers.LSTM(lstm_OutputDim)(lstm_embed_layer) # the output has dimension (None, 12) # need to Repeat the LSTM output from a 2D matrix to a 3D tensor in order to concatenate to the category-embedding lstm_reshape = keras.layers.RepeatVector(nTopics)(lstm_output) topic_input_layer = keras.layers.Input(shape=(X_topics_train.shape[1],), dtype='int32', name='topic_input') # input the topics topic_embed_layer = keras.layers.Embedding(input_dim=nTopics, output_dim = embed_dim_topic, input_length=X_topics_train.shape[1])(topic_input_layer) # topic embedding # need to reshape x = keras.layers.concatenate([lstm_reshape,topic_embed_layer],axis=2) # is this axis 1 or 2?? hidden_layer = keras.layers.Dense(hidden_nodes_final, activation='relu')(x) #x = keras.layers.BatchNormalization()(x) main_output = keras.layers.Dense(4, activation='softmax', name='main_output')(hidden_layer) # main output for categories model = keras.models.Model(inputs=[lstm_input_layer,topic_input_layer], outputs=[main_output]) model.compile(loss = "categorical_crossentropy", optimizer='adam',metrics = ['accuracy']) print(model.summary()) # I guess I can use both model.fit({'lstm_input': X_train, 'topic_input': X_topics_train }, # inputs {'main_output': Y_train}, # targets epochs = 50, # batch_size=batch_size, verbose = 2, validation_data=([X_val, X_topics_val], Y_val), # validation data sample_weight = W_train) # sample-weights for unbalanced design

After compiling the model, the keras API returns the following list of layers and their shape.

______________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== lstm_input (InputLayer) (None, 198) 0 __________________________________________________________________________________________________ embedding_1 (Embedding) (None, 198, 194) 582000 lstm_input[0][0] __________________________________________________________________________________________________ lstm_1 (LSTM) (None, 100) 118000 embedding_1[0][0] __________________________________________________________________________________________________ topic_input (InputLayer) (None, 145) 0 __________________________________________________________________________________________________ repeat_vector_1 (RepeatVector) (None, 145, 100) 0 lstm_1[0][0] __________________________________________________________________________________________________ embedding_2 (Embedding) (None, 145, 6) 870 topic_input[0][0] __________________________________________________________________________________________________ concatenate_1 (Concatenate) (None, 145, 106) 0 repeat_vector_1[0][0] embedding_2[0][0] __________________________________________________________________________________________________ dense_1 (Dense) (None, 145, 122) 13054 concatenate_1[0][0] __________________________________________________________________________________________________ main_output (Dense) (None, 145, 4) 492 dense_1[0][0] ================================================================================================== Total params: 714,416 Trainable params: 714,416 Non-trainable params: 0 __________________________________________________________________________________________________ None

According to the out-of-sample AUC statistics, the model performed quite well (>0.95). What is especially interesting is the learned relationships among the different categories, as visualized in figure 4 (below). The fact that related categories cluster together in space demonstrates that the model does indeed learning meaningful embeddings of the categories.

Fig 4: Consumer feedback categories and their learned position in a low-dimensional embedding space (here, visualized in 2D with t-SNE. |

- the 'generic' LSTM model is still present in this embedding model: it merely represents the chain of layers: `lstm_input_layer`, `lstm_embed_layer`, `lstm_output` (representing the word-embedding and the LSTM layers).
- the category-embedding goes through the chain: `topic_input_layer`, `topic_embed_layer`
- in order to combine outputs of the LSTM chain with the outputs from the category-embedding chain, I had to repeat the outputs of the LSTM so that it had the same dimensions as the category-embedding in the X and Y axes (i.e., concatenating along the Z axis): `lstm_reshape = keras.layers.RepeatVector(nTopics)(lstm_output)`

### Loss function: Importance of the Multi-class loss

Notice also, that both the final layer (`hidden_layer`) and the Y targets are 3D dimensional. This has a very important implication for the `categorical_crossentropy` loss function: the unit-vector which sums to 1 is taken along the z-axis; in the model specification, this z-axis represents the 4 different sentiment categories per a particular category: [NA = not-present; negative; neutral; positive]. The columns are just different possible categories, each with their own sentiment-unit-vector heading off in the Z-axis direction. This is summarized in the figure 5. This has two important points:

- First, the entire analysis is "multi-class" classification, such that each piece of text (a row in the matrix) can have multiple labels (which makes sense, because people can and do talk about different things in their customer feedback.
- Second, the rows (different texts) and columns (different categories) are essentially treated as
**independent**observations, according to the loss function!

Therefore, we see

**why**the embeddings may help the predictive performance of the model, and why the model is able to learn meaningful embeddings of the categories:

*were the model not to learn the associations among the categories (via the embeddings), then each category would contribute an independent amount to the loss function, irrespective of all the other categories, per sentence.*With the embeddings, the model is learning a structure that helps it predict which columns/categories are related.

The specification of this loss function (such that each category is independent in the eyes of the loss function) is the crux of the embeddings usefulness. In the next section, I will do another seemingly similar analysis according to a different loss function, with very different results.