Thursday, 25 October 2018

Boosting Deep Dive: from MLEs to Hypothesis Testing

I wanted to write a small tutorial outlining some of the key concepts of "boosting". Boosting is a broad topic that is related to many other methods (e.g., Random Forest, SVM, bagging, Lasso), in addition to have many flavours (adaboost, component-wise boosting, boosted regression trees, twinboosting). In this tutorial, we'll walk through the basics of boosting and visualize some of its key statistical properties. Eventually, we'll build-up to the powerful XGBoost (winner of so many kaggle competitions). We'll show how this powerful learner comes from a simple algorithm which encompasses not only XGBoost, but also l2- and l1-regularzation.

Part of the confusing nature of boosting is that its core concepts have arisen from different disciplines. The majority come from workers in machine learning and learning theory. Later, statisticians united the myriad methods under a statistical framework. For more on this history, I recommend reading Buhlmann and Hothorn or Mayr and Ratch for the modern synthesis across methods. Boosting is very general and comprehensive.

Let's start off with a definition of boosting: "Boosting is a statistical modelling technique which uses functional gradient descent to minimize a cost function while sequentially building an ensemble of shrunken weak learners".

Key Concepts

I want the audience to understand these four key concepts:

  • how Functional Gradient Descent works
  • what are Weak Learners and why are they crucial for boosting
  • what is an Ensemble of weak learners
  • what is shrinkage

These concepts separate boosting from similar methods. Later in the blog, I'll discuss other important concepts like l1-regularization, l2-regularization, and hypothesis testing / l0-regularization.

Key Notation

  • X: A matrix of covariate/independent effects
  • Y: A response variable (here, normally distributed)
  • F: "fit-vector" this is our predicted value of y (on the working scale)
  • g(y,x): a learner; tries to predict Y from X: g(X)->Y
  • r: negative-gradient
  • nu: learning rate, less than 1 

Weak Learners

Weak Learners produce a classifier that is slightly better than random guessing (> 50%). The foundations of boosting arise from the insight that weak learners can be iteratively improved (boosted) to produce an overall strong learner. Schapire and Freund developed the first boosting algorithms, such as Adaboost.

Example with trees:
A weak learner may be a CART-model with a low maximum tree-depth (figure below, right). It is biased, but low variance (i.e, it doesn't vary much with a new sample of data). A strong CART learner would have a higher maximum tree depth (figure below, left), given it more flexibility and yielding less-biased estimates. But, its estimates would vary more with a new sample of data.

CART Learner: strong vs. weak: low-bias high-variance (left), high-bias low-variance (right)

Example with linear-learners (least squares learners, splines, etc).
In it's normal operations, a least-squares learner provides the maximum-likelihood solution to a linear regression between covariates X and continuous response Y (figure below, top). Least-Squares learners can also be weak or strong. For example, a weak version of a least-squares learner would be penalized least-squares (aka, ridge-regression). A strong ridge-penalty decreases the learner's flexibility, complexity, and degrees-of-freedom: it is more constrained and must discriminate among its potential covariates. Like the week CART learner, it is also biased and low variance. In contrast, an unpenalized least-squares learner will just do vanilla linear-regression: certainly a powerful solution that has been the backbone of data analysis for over 100 years.

Least Squares Learner: strong (unpenalized) vs weak (penalized)
Why does weakness matter?
The "weak learner" is a key feature of boosting. Such weak learners separate boosting from the random forest algorithm. In random forest, one creates an ensemble of strong-learners: each is a complex, high-variance tree-learner with a deep maximum tree-depth. In contrast, in boosted regression trees, one uses trees with relatively low max-depth. Other types of learners are possible, like splines, radial basis functions, least-squares: so long as we have some way to heavily penalize complexity and keep them weak.

Why the emphasis on weak learners? Weak Learners must make the best of a bad situation: they must "choose" one or a few of the features, within the bounds of their contraints, whereas strong learners can through everything at a problem.

More formally, the weakness of learners is one of the ways in which boosting is able to negotiate the famous "bias-variance trade-off": at low boosting iterations, the model is simple and biased; at greater boosting iterations, the model is more complex, less-biased, and higher variance. Negotiating the number of boosting iterations prevents overfitting. If we use cross-validation to "tune" the number of boosting iterations, then Buhlmann and Yu showed how boosting has an "adaptive complexity" at the level of the entire ensemble: it can be either low complexity or high complexity, governed by the number of boosting iterations. This can be seen in the animation below, where the fitted spline has maximum degrees-of-freedom 4 (like 4th order polynomial), while the data is clearly 5th or 6th order (it is a sin curve). Nonetheless, by combining twenty 4th-order splines through boosting, the final model clearly can approximate the higher-order data. Weak learners are boosted to yield a strong, complex ensemble-learner.

Functional Gradient Descent

You've most likely heard of "gradient descent". Figure XXX shows how a parameter (beta) can be iteratively updated by incrementing its value according to the gradient of the loss function. Note that it is the parameter that is updated. Let's contrast this with FGD.

Gradient Descent vs. Functional Gradient Descent

In FGD, we literally fit a learner to the negative gradient (called g). We are iteratively updating the entire "fit-vector" (our best estimate of Y). This is an iterative process, such that as we descend the gradient of the loss function, we generate many fitted-learners, each of which has been fitted at different slopes along the gradient descent. The full algorithm is shown above and in the code below.

Why are we iteratively fitting the negative gradient? An intuitive explanation comes from the early boosting algorithm Adaboost. In that classification algorithm, one sequentially fits the parts of the data that are difficult to classify in earlier iterations. In other words, the algorithm focuses on those cases with very high residuals (lots of unexplained variance).

Functional Gradient Descent: (top) the current residuals at boosting iteration m, the purple learner is fit to the residuals, and the orange line is the shrunk versions, shrunk by the learning rate; (bottom) the ensemble learner is just the sum of all the previously fit learners.

With a Gaussian loss function, the negative gradient is just the residuals: y-f. Thus, by fitting the negative gradient, we are likewise forcing our boosting to first do-away with the easy cases (small residuals), and then sequentially focus on the tougher cases (bigger residuals). This notion of "adaptive residuals" is generalized for other non-Gaussian loss functions by using the negative gradients.

# Gaussian Loss
(y - f)^2
# Gaussian negative-gradient
y - f

#Binomial Loss:
p <- exp(f)/(exp(f) + exp(-f))
-y * log(p) - (1 - y) * log(1 - p)

#Binomial negative gradient:
exp2yf <- exp(-2 * y * f)
-(-2 * y * exp2yf)/(log(2) * (1 + exp2yf))

# Negative Binomial Loss
-(lgamma(y + exp(f)) - lgamma(exp(f)) - lgamma(y + 1) + exp(f) *  
     f + y * log(mu) - (y + exp(f)) * log(mu + exp(f))) 

# Negative Binomial negative-gradient
mu:  y - (y + sigma)/(exp(f) + sigma) * exp(f)
sigma:- exp(f) * (digamma(y + exp(f)) - digamma(exp(f)) +
                 log(exp(f)) + 1 - log(mu + exp(f)) -
                 (exp(f) + y)/(mu + exp(f)))

# see ?Family from package mboost for more

Ensemble Learning

If we ran the boosting algorithm for 20 iterations, then we would have 20 fitted learners. This forms an ensemble of leaners. The final model is the aggregate of sum all the learners. In order to make predictions on new data, each learner would have to make a prediction, and those predictions are summed (element-wise). The aggregation is shrunk by a learning rate. The learning rate is small, so that the outputs from each weak learner is small.

The way we aggregate the predictions of the ensemble of learners is different from Bagging methods (like Random Forest) which aggregate the learners in some other way (like majority voting).


A key part of boosting is that we "stop short" the algorithm. The number of iterations is determined by, e.g., cross-validation or the AIC. For least-squares learners, if we ran the algorihtm until infinity, we would eventually get the Maximum Likelihood Estimate of the parameters. For some smaller number of iterations, the estimated parameters will be shrunk towards zero. Copas 1983 discovered the importance of shrinkage for prediction.

The shrinkage phenomenon is strange. It is related to the James-Stein effect for estimating the means of a multivariate Gaussian. Copas then noticed a similar effect for GLM models. Boosted models are a type of shrinkage estimator.

The shrinkage manifests as a slight bias-towards-zero (away from the MLE). We now understand that this is a way of negotiating the classic Bias-Variance Trade-off. For more on the Bias-Variance trade-off and shrinkage, see my blog post at called "A primer on the Bias-Variance Trade-off".

Many non-statistician researchers are shocked to see such obvious bias. But it is strategic: we are incurring some bias so to reduce the variance and reduce the overall Expected Risk (the Expected Mean Square Error). Having a lower Expected Risk means that we are more likely to be closer to the truth even if we are systematically biased. If you don't grok this idea of the bias-variance trade-off, then Copas (1997) offers an intuitive explanation: "One of the classic examples of regression to the mean is the inheritance of height. Tall fathers tend to have shorter sons, and short fathers tend to have taller sons. Here y1 (father's height) and y2 (son's adult height) are two random variables with (almost exactly) the same distribution, but the conditional expectation E[y2|y1] is not y1 but a value closer to the overall mean."

PART 1: Example of Functional Gradient Descent I: Shrinkage Estimator

I will now show FGD in action by manually performing FGD with just one Least Squares learner (e.g,. a multiple regression learner). The dataset is the famous diabetes data (442 observations, 10 covariates) Efron et al "Least Angle Regression" paper. Our learner will use all 10 covariates: age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu

Let's first load the data and calculate the MLEs

library(MASS) # get the lm.ridge function
library(lars) # get the LASSO/LAR functions
data(diabetes) # data set to analyze

# response variable: y
y <- diabetes$y

# explanatory covariates
x <- scale(diabetes$x) # standardize the covariates
xdata <- setNames($x),nrow(diabetes$x),ncol(diabetes$x))),nm=colnames(x))

# nummber of observations
n <- nrow(xdata)

# number of covariates
K <- ncol(xdata)

# show the maximum likelihood estimates
coef(lm(formula = y~.,data=data.frame(cbind(y=y,xdata))))

MLEs of the diabetes data:

> coef(lm(formula = y~.,data=data.frame(cbind(y=y,xdata))))
(Intercept)         age         sex         bmi         map          tc         ldl         hdl         tch         ltg         glu 
  152.13348   -10.01220  -239.81909   519.83979   324.39043  -792.18416   476.74584   101.04457   177.06418   751.27932    67.62539 

Now let's define our learner: it does very simple thing, just returns a function ("fit") that calculates the least-squares solution for a given Y.

# Learning: Least Squares:
LeastSquaresLearner <- function(X, penalty = 10^-7){
    # X: input the covariates (X); 
    # penalty: possible ridge penalty
    # add intercept to the data: X
    X <- as.matrix(cbind(intercept = rep(1,nrow(X)),X))
    # (optional) add ridge penalty as a diagonal matrix
    Omega = diag(ncol(X))*penalty

    # X'X  + Ridge Penalty
    XtX <- crossprod(X, X)+Omega

    # Fitting function: get least-squares solution
    fit <- function(y){

        # get betas: inv(Xt*X+Omega) * Xt * y 
        beta_hat = solve(XtX, crossprod(X, y)) # inv(Xt*X+Omega)*Xt * y 
        beta_hat = as.numeric(beta_hat)

        # prediction function
        predict <- function() X%*%beta_hat 

        # return fitted learner
                beta = beta_hat, # LS solution
                predict = predict # provides the fit vector
    # outer return 
    return(list(fit = fit))

# try the learner
l1 = LeastSquaresLearner(xdata)
l1_fit = l1$fit(y)
f = l1_fit$predict()

Finally, we'll perform Functional Gradient Descent. A each iteration we: update the residuals/negative-gradient (r) by taking the difference between the fit-vector (F) and the response variable (Y); then fit the learner to the negative-gradient (g), increase our fit-vector by the outputs from the learner, shrunk by the learning-rate (nu).

Key features:

  • the learner is NOT weak, it is an unpenalized multi-regressor.
  • there is only ONE learner: the learner performs multiple-regression.
  • this is for illustrative purposes only. No one would do this in reality (there are much better estimators)
  • boosting hyperparameters: the number of boosting steps is and the learning-rate.

# PART 1 : SHRINKAGE (Copas 1983)

# let's compare to functional gradient descent!
mstop = 1200 # number of iterations
nu= 0.05 # learning rate 
beta_FGD <- matrix(0,mstop, ncol(xdata))

# intialize the learner
ls_learner = LeastSquaresLearner(xdata)

# initialize the model: model of the mean
y_mean <- mean(y)
F <- rep(y_mean,n)

# initialize the gradient
negative_gradient <- function(y,F){ y-F } 
r <- negative_gradient(y,F)

# risk function: sum of square error
loss <- function(y,F) (y-F)^2
risk <- function(y,F) sum(loss(y,F))
vRisk <- numeric(mstop)

# boosting
for(m in 1:mstop){

    # fit the learner to the residual
    g <- ls_learner$fit(r)

    # update the Model
    F <- F + nu * g$predict()

    # recalculate the gradient
    r <- negative_gradient(y, F)

    # store: empirical risk
    vRisk[m] <- risk(y,F)
    # store the coeffcients (minus the intercept)
    beta_FGD[m,] <- nu*g$beta[-1] 


# Betas: per boosting step
beta_boost <- apply(beta_FGD, 2,cumsum)
# Beta: maximum likelihood estimation
beta_MLE <- coef(lm(y~.,data = data.frame(cbind(y=y,xdata))))

# plot the regularization pathways
plot(range(log(1:mstop)), range(beta_boost),type="n", xlab = "log(boosting step)", ylab = expression(hat(beta)),main = "Functional Gradient Descent:\n Regularization Pathways")

# plot each coefficient
for(k in 1:ncol(beta_FGD)){
    lines(x=log(1:mstop), y=beta_boost[,k], col=k)

# add MLEs
points(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, pch=19)
text(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4)
text(log(mstop+1), max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA)

Regularization Pathways: as we go to the left (more steps in the gradient descent) the regression coefficients increase until they reach the Maximum Likelihood Estimates (right). 

Key Takeaways (Shrinkage Estimator)

  • the FGB eventually produces the Maximum Likelihood Estimates as the number of boosting iterations gets very large.
  • at smaller boosting iterations, the betas are shrunk towards zero but none are exactly zero.
  • this is most similar to the "pre-shrunk" estimators from Copas 1983 (no one uses these anymore).
  • there is no feature selection at lower m: all betas are shrunk-towards-zero by a constant from their MLE's (i.e, early stopping doesn't favour one covariate vs. another). The strong learner makes use of every single feature at every iteration.

PART 2: Example of Functional Gradient Descent II: Ridge Regression (l2-regularization)

In the next example, we will use the same learner from above, but the learner will be penalized. We will apply a strong ridge penalty that reduces the learners degrees of freedom, i.e., the learner will have to start "choosing" among features within the bounds of its constraints.

Key Features:

  • there is only ONE learner: the least-square learner performs penalized multiple-regression.
  • the penalized learner is weak compared to the previous section.
  • hyperparameters: mstop=7000; learning-rate nu=0.5; the learner-penalty of 7


mstop = 7000 # number of iterations
nu= 0.5 # learning rate 
beta_FGD <- matrix(0,mstop, ncol(xdata))

# intialize the learner
ls_learner = LeastSquaresLearner(xdata,penalty=7)

# initialize the Fit-vector
y_mean <- mean(y)
F <- rep(y_mean,n)
# goal: get F close to Y (minimizes loss)

# risk function: sum of square error
loss <- function(y,F) (y-F)^2
risk <- function(y,F) sum(loss(y,F))
vRisk <- numeric(mstop)

# initialize the gradient
negative_gradient <- function(y,F){ y-F } 
r <- negative_gradient(y,F)

# boosting
for(m in 1:mstop){

    # fit the learner to the residual
    g <- ls_learner$fit(r)

    # update the Fit Vector
    F <- F + nu * g$predict()

    # recalculate the gradient
    r <- negative_gradient(y, F)

    # estimate the loss
    vRisk[m] <- risk(y,F)
    # store the coeffcients (minus the intercept)
    beta_FGD[m,] <- nu*g$beta[-1] 


# Betas per boosting step
beta_boost <- apply(beta_FGD, 2,cumsum)
# Beta: maximum likelihood estimation
beta_MLE <- coef(lm(y~.,data = data.frame(cbind(y=y,xdata))))

# plot the regularization pathway
plot(range(log(1:mstop)), range(beta_boost),type="n", xlab = "log(boosting step)", ylab = expression(hat(beta)),main = "Functional Gradient Descent:\n Regularization Pathways")

# plot each coefficient
for(k in 1:ncol(beta_FGD)){

    lines(x=log(1:mstop), y=beta_boost[,k], col=k)

# add MLEs
points(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, pch=19)
text(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4,xpd=NA)
text(log(mstop+1), max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA)

# Compare to ridge regression
penalty_seq <- exp(seq(log(10000),log(0.2),length.out = 50))  # 0.01 to 10

# fit ridge regr for each value of the penalty_seq
lridge <-  lm.ridge(y ~ age + sex + bmi + map + tc + ldl + hdl + tch + ltg + glu, data = data.frame(cbind(y=y, xdata)),lambda = penalty_seq)

# all ridge solutions
beta_ridge <- coef(lridge)[,-1] # -1 ignores the intercept

# make a plot
plot(rev(range(log(penalty_seq))), range(beta_ridge),type="n", xlab = "ridge penalty", ylab = expression(hat(beta)), xaxt = "n", main = "Ridge: Regularization Pathways")
axis(1,at=rev(log(penalty_seq)), labels=sprintf("%0.5s",penalty_seq),las=2)
# plot each coefficient
for(k in 1:ncol(beta_ridge)){

    lines(x=rev(log(penalty_seq)), y=beta_ridge[,k], col=k)

# add MLEs
points(rep(max(log(penalty_seq)),K), beta_MLE[-1], col = 1:K, pch=19)
text(rep(max(log(penalty_seq)),K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4)
text(max(log(penalty_seq)), max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA)

For comparison, I have shown the "regularization pathways" of a Ridge Regression solution. From left to right, we are weakening the regularization of the Ridge Regression (lambda hyperparameter), until eventually there is no penalty and the solutions are the MLEs.

Notice any similarity between the regularization pathways of the boosted penalized least-squares learner and the ridge-regression?

Regularization Pathways for l2-regularizer: left is stronger regularization, right is weaker regularization

Key Takeaways.

  • boosting can approximate the solutions of l2-regularization (Ridge Regression).
  • a penalized/weak learner is forced to discriminate among features, which begets some feature learning. You can see this in the above regularization pathways in which the betas have different relative magnitudes at different steps (e.g, "tc" is disfavoured until about exp(6)=403 boosting iterations, after which its beta value explodes in magnitude).
  • at smaller boosting iterations, the betas are shrunk towards zero but none are exactly zero
  • the FGB obtains the MLEs as the number of boosting iterations goes to infinity

NOTE The optimal number of boosting iterations should be decided by cross-validation or bootstrap validation. The above was just for educational purposes.

PART 3: Example of Functional Gradient Descent III: Component-wise Boosting

In the next example, we will use a different learner per covariate. In Examples I and II, there was one least-squares learner, which performed multiple regression using all the covariates. Our next algorithm, which uses a different learner per covariate, is called component-wise boosting. The key is that for each boosting iteration,  only one learner/covariate is picked to enter the ensemble.

Something magical happens with component-wise boosting :)

Key Features:

  • there is one learner per covariate.
  • each learner competes to fit the gradient and add it's fit to the overall fit-vector F.
  • boosting hyperparameters: the number of boosting steps is mstop=5000 and the learning-rate is 0.2; the learners are unpenalized

mstop = 5000 # number of iterations
nu= 0.2 # learning rate

# store the coefficients per boosting step
beta_FGD <- matrix(0,mstop, ncol(xdata))

# store the learners
ensemble <- vector(mode="list",length = mstop)

# intialize the learners
# notice there are 10 learners: one learner per covariate
ls_learners = setNames(vector(mode = "list", length = K),nm=colnames(xdata))
for(k in 1:K){
    ls_learners[[k]] <- LeastSquaresLearner(xdata[,k,drop=FALSE])
# initialize the fit-vector (F) current estimate of y
y_mean <- mean(y)
F <- rep(y_mean,n) # Fit-vector
# goal: get F close to Y

# risk function: sum of square error
loss <- function(y,F) (y-F)^2
risk <- function(y,F) sum(loss(y,F))
vRisk <- numeric(mstop)

# initialize the gradient
negative_gradient <- function(y,F){ y-F } 
r <- negative_gradient(y,F)

# dummy list: temporary storge of learners
g.list <- setNames(vector(mode = "list", length = K),nm=colnames(xdata))

# boosting
for(m in 1:mstop){

    # fit EACH learner to the residual
    g_SS <- numeric(K)
    for(k in 1:length(ls_learners)){
        # fit each learner
        g.list[[k]] <- ls_learners[[k]]$fit(r)

        # calculate GOF statistic per learner
        g_SS[k] <- sum((g.list[[k]]$predict()-r)^2)

    # get best learner
    best_learner <- which.min(g_SS)
    g <- g.list[[best_learner]]

    # update the Fit Vector 
    F <- F + nu * g$predict()

    # recalculate the gradient
    r <- negative_gradient(y,F)

    # store the empirical risk
    vRisk[m] <- risk(y,F)
    # store the coeffcients (minus the intercept)
    beta_FGD[m,best_learner] <- nu*g$beta[-1] 

    # add learner to the ensemble (optional)
    ensemble[[m]] <- g

# add up all the coefficients per learner
beta_boost <- apply(beta_FGD, 2,cumsum)

plot(range(log(1:mstop)), range(beta_boost),type="n", xlab = "log(boosting step)", ylab = expression(hat(beta)), ,main = "Functional Gradient Descent:\n Regularization Pathways")
# plot each coefficient
for(k in 1:ncol(beta_FGD)){

    lines(x=log(1:mstop), y=beta_boost[,k], col=k)
# ADD mles
points(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, pch=19)
text(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4,xpd=NA)
text(log(mstop+1), max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA)

# LASSO: estimate the lasso betas
lasso <- glmnet(x=as.matrix(xdata),y = y, alpha=1,lambda=exp(seq(log(45),log(0.025),length.out=1000)))
beta_lasso <- t(lasso$beta)
vLambda <- lasso$lambda

# plot the lasso betas 
rangebeta = range(beta_lasso)
nsteps = nrow(beta_lasso)
plot(c(1,nsteps), rangebeta,type="n", xlab = "LASSO regularization steps",ylab = expression(hat(beta)))
for(k in 1:ncol(beta_lasso)){
    lines(1:nsteps, beta_lasso[,k], col = k)
points(rep(nsteps,K), beta_MLE[-1], col = 1:K, pch=19)
text(rep(nsteps,K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4,xpd=NA)
text(nsteps, max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA)

For comparison, I have shown the "regularization pathways" of Lasso Regression. From left to right, we are weakening the regularization on the Lasso (lambda hyperparameter), until we get to the far-right solution where we basically obtain the MLEs. Towards the left (stronger regularization), notice that only a few of the covariates have every been picked and have non-zero coefficients. In other words, at low iterations, the Lasso provides a sparse solution: some of the less-important covariates have coefficients that are exactly zero. This is a type of feature selection: the algorithm is learning to pick and chose among covariates, shrinking some to absolute zero.

Notice any similarity between the regularization pathways of the component-wise boosting and the Lasso? In general, they are similar but not exact. We say that they are approximately the same, and that component-wise boosting provides a type of l1-regularization (Hastie,Tibshirani,Friedman 2001).

l1-Regularization: left is stronger regularization, right is stronger regularization

Key Takeaways.

  • boosting can approximate the solutions of l1-regularization (Lasso regression).
  • component-wise boosting and l1-regularization provide automatic feature selection.
  • the FGB obtains the MLE as the number of boosting iterations gets large.
  • at smaller boosting iterations, some betas are shrunk towards zero (i.e., a sparse solution).
NOTE The optimal number of boosting iterations should be decided by cross-validation or bootstrap validation. The above was just for educational purposes.

The benefit of using component-wise boosting over a LASSO model is that we can have very complex combinations of learners. For example, in the R package mboost, we might like to have some interactions between sex and the other covariates.

Why all the excitement for l1-regularization?

A key feature of Lasso-like solutions is a property called minimax optimal rate of convergence in high-dimensional regression. In English, this means that our worst-case theoretical error is minimized as fast as possible with growing n. Among frequentists, minimax-optimality is just one of many desirable properties of an estimator: it provides a theoretical guarantee that in the worst-case (high error), we'll still beat all the other losers.

Furthermore, l1-regularization also often has the lowest Prediction Errors (e.g., Expected Errors). Let's demonstrate this through cross-validation, comparing the Shrinkage model, the l2-regularization model, component-wise boosting model, and of course Maximum Likelihood Estimates (MLE)

We will run all these models using the R package mboost.  The least-squares learners are exactly the same as in the previous examples. See ?bols for an in-depth description of how mboost's formula and learner specification.

# Cross-Validation: compare four estimators in terms of prediction error
# Models: MLEs, Shrinkage, Ridge, and Component-wise boosting

library(mboost) # boosting library

# join Y and X into a single dataframe
alldata <- data.frame(cbind(y=y, xdata)) 

# make the cross-validation weights
cv10f <- cv(weights = rep(1,n), type = "kfold")

# re-run each model in mboost: Shrinkage, l2/Ridge, l2/Lasso, MLE
# boosting with one unpenalized learner (not-boosting technically, just a Copas shrinkage model)
m.shrinkage <- mboost(y~bols(age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu), data = alldata,control=boost_control(mstop=10000,nu=0.2))
# boosting with one penalized learner (l2-regularization) 
m.l2 <- mboost(y~bols(age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu,lambda=7), data = alldata,control=boost_control(mstop=10000,nu=0.2))
# component-wise boosting 
m.l1 <- mboost(y~bols(age)+bols(sex)+bols(bmi)+bols(map)+bols(tc)+bols(ldl)+bols(hdl)+bols(tch)+bols(ltg)+bols(glu), data = alldata,control=boost_control(mstop=10000,nu=0.2))
m.MLE = glm(y~.,data=alldata)

# do 10-fold cross-validation on each model
cvfit.shrink <- cvrisk(m.shrinkage, folds = cv10f, papply = mclapply,mc.cores=5) # copas model (not-boosting)
cvfit.12 <- cvrisk(m.l2, folds = cv10f, papply = mclapply,mc.cores=5) # l2 (ridge)
cvfit.11 <- cvrisk(m.l1, folds = cv10f, papply = mclapply,mc.cores=5) # l1 (component-wise)
cvfit.MLE <- lapply(1:10,function(k){
    # insample weights
    # run cv glm
    # loss function (y-f)^2

# Expected Error (according to optimal boosting step per model)
cvError = c(
    "MLE" = mean(unlist(cvfit.MLE)),
    "Shrinkage Model (Copas)" = min(colMeans(cvfit.shrink)),
    "l2 (Ridge)" =  min(colMeans(cvfit.12)),
    "l1 (Component-wise)" =  min(colMeans(cvfit.11))

# plot the CV-errors
barplot(cvError,ylim=range(pretty(cvError)),las=2,ylab="Expected Error (10-fold CV)", main = "Component-wise boosting wins, with lowest predictive error\n (\u21131-regularization) ",xpd=FALSE,col="pink")

Comparison of prediction error from cross-validation, according to four different types of estimators. Lower prediction error is better.

Why this order of predictive performance? From worse to best:

  1. Maximum likelihood model has no shrinkage, no feature learning
  2. The shrinkage model has shrinkage, but no feature learning
  3. The l2 model has shrinkage, and some weak feature learner
  4. The l1 model (component-wise boosting) has shrinkage and does automatic feature learning, successfully setting some unimportant covariates to zero

PART 4: Example of Functional Gradient Descent IV: Boosted Regression Trees.

In the next example, our learner will be a single (weak) decision trees from the package rpart (which is a CART model). Despite this wildly different learner, the FGD algorithm is the same: we sequentially fit the learner to the negative gradient (residuals1). The only thing that has changed is the learner: a CART model with small maximum depth vs penalized least-squares.

We will compare this boosted regression tree model to the FAMOUS and POWERFUL XGBoost! (go-to solution for many Kaggle competitions, and rightly so). We start by defining our learner:

TreeLearner = function(X, control = ctree_control(maxdepth=2,mtry=round(0.7 * ncol(X)),mincriterion=0), subsample = 1){

    mf <-
    n <- nrow(mf)
    if(subsample <1){
        get_weights <- function() tabulate(sample(1:n,size=round(n*subsample)),n)
    } else {
        get_weights <- function() rep(1,n) 
    # fit a PARTY model (conditional inference tree)
    fit = function(y){
        # fit a tree model
        btreemodel <-ctree(y ~ ., data = data.frame(cbind(y=y,mf)), weights = get_weights(), controls=control)
        # return the predictions
        predict <- function(newdata=NULL) btreemodel@predict_response(newdata)
            predict = predict
    return(list(fit = fit))

Key Features:

  • one CART learner, with hyperparameters: maxdepth (2), minbucket (7), subsampling rate 0.66
  • for each boosting iteration, the learner gets a random subsample of rows of data (like bagging!)
  • for each boosting iteration, the learner gets a random subsample of columns of features (like random forest!)
  • boosting hyperparameters: the number of boosting steps is 200 and the learning-rate is 0.02.

# initialize the fit-vector
y_mean <- mean(y)
F <- rep(y_mean,n)

# initialize the gradient
negative_gradient <- function(y,F){ y-F } 
r <- negative_gradient(y,F)

# initialize the learner
learners <- TreeLearner(xdata,subsample=0.66)

# initialize the ensemble
mstop <- 200
nu <- 0.02
ensemble <- vector(mode = "list", length = mstop)

# tree boosting
for(m in 1:mstop){

    g <- learners$fit(r)
    # update the Fit-vector
    F <- F + nu * g$predict()

    # recalculate the gradient
    r <- negative_gradient(y,F)

    # add the learner to the ensemble
    ensemble[[m]] <- g


# fit an XGBoost model
xgmod <- xgboost(data = as.matrix(xdata),
                 label = y,
                 max_depth = 2, # maximum tree depth
                 eta = nu, # shrinkage rate
                 nthread = 2,
                 nrounds = mstop,
                 objective = "reg:linear",
                 colsample_bytree = 0.7,
                 min_child_weight = 7 # weights needed in each child to allows splitting

# make marginal plots: compare XGBOost and our manual boosting
marginal_data <-,2,function(x){rep(median(x),50)}))

for(k in names(xdata)){

    # set all other convariates to their mean (0) (other than k)
    marginal_tmp <- marginal_data
    # predict over the range of the target covariate
    marginal_tmp[,k] <- seq(min(xdata[,k]),max(xdata[,k]),length.out = 50)

    # get my predictions    
    f <- rep(y_mean,50)
    for(g in 1:length(ensemble)){
        # update f
        f <- f + nu*ensemble[[g]]$predict(newdata=marginal_tmp)

    # get predictions from XGboost
    # plot my boosting function
    plot(marginal_tmp[,k], f, type="l",main = k, col = "red",ylim = range(f,f.xgboost), lwd=2)
    # add line for xgboost (blue)
    lines(marginal_tmp[,k], f.xgboost, col ="blue", lwd=2)

    # legend
    legend(x="bottomright",legend=c("xgboost","manual"), col=c("blue","red"),lwd=c(2,2), bty="n")

# compare the two algorithms
plot(F,predict(xgmod,newdata=as.matrix(xdata)),ylab="xgboost",xlab="manual boost",main="XGBoost vs. MyTreeBoost");abline(0,1)

My manual version of boosted regression trees vs. XGBoost. Marginal plots.

Key Takeaways:

  • boosted regression trees is just a type of boosting (like we've done above) but with a CART learner
  • XGBoost is just very similar type of FGD but with a custom (and computationally efficient, powerful) tree model (not CART).
  • the tree model allows non-linear and complex interactions, making it potentially very powerful

PART 5: Real World Examples: Boosting in R with Package mboost.

The above manual boosting algorithms were just for illustrative purposes of the more universal features of boosting: weak learners, functional gradient descent, shrinkage, sparsity (for l1-regularization and boosted regression trees)

For a real data analysis, one would use the powerful package mboost. Here are a variety of learners available to you to mix and match in mboost.

# least squares learners
bols(x1,by=x2) # interaction

# splines
bbs(x1) # 
bbs(x1,by=x2) # bivariate spline

# tree learners
btree(x1,x2) # 

# radial basis function

# random effect

# spatial smooth

# markov random field

Example: Component-wise boosting with linear base-learners (e.g., like a Lasso solution)

# LASSO-like regularization
m.lasso <-  mboost(formula = y ~ bols(age)+bols(sex)+bols(bmi)+bols(map)+bols(tc)+bols(ldl)+bols(hdl)+bols(tch)+bols(ltg)+bols(glu),
                  data = xdata,
                  family = Gaussian(), # controls the gradient and loss function
                  control = boost_control(
                      mstop = 200, # number
                      nu = 0.03) # learning rate

Here is how we can specify possible interactions (with sex).

m.interaction <-  mboost(formula = y ~ bols(age,by=sex)+bols(sex)+bols(bmi,by=sex)+bols(map,by=sex)+bols(tc,by=sex)+bols(ldl,by=sex)+bols(hdl,by=sex)+bols(tch,by=sex)+bols(ltg,by=sex)+bols(glu,by=sex),
                  data = xdata,
                  family = Gaussian(), # controls the gradient and loss function
                  control = boost_control(
                      mstop = 200, # number
                      nu = 0.03) # learning rate

Example: Generalized Additive Model (GAM) with splines.

xdata2 <-, xdata)) # and a column vector of 1's for the intercept

m.gam <- mboost(formula = y ~ bbs(age,df=4)+bbs(bmi,df=4)+bbs(map,df=4)+bbs(tc,df=4)+bbs(ldl,df=4)+bbs(hdl,df=4)+bbs(tch,df=4)+bbs(ltg,df=4)+bbs(glu,df=4)+bols(sex,df=4,intercept=FALSE)+bols(interc,intercept=FALSE),
                  data = xdata2,
                  family = Gaussian(), # controls the gradient and loss function
                  control = boost_control(
                      mstop = 150, # number
                      nu = 0.02) # learning rate

Here is how to do a boosted regression trees with a special tree model called conditional inference trees (e.g., somewhat like XGBoost).

# boosted regression trees
m.tree <- mboost(formula = y ~ btree(age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu) ,
                  data = xdata2,
                  family = Gaussian(), # controls the gradient and loss function
                  control = boost_control(
                      mstop = 200, # number
                      nu = 0.02) # learning rate


PART 6: "Hypothesis Testing" with Boosting.

Everything we've discussed so far has pertained to estimation and prediction. Aside from the Lasso's sparse selection of features, we haven't made inferences about which covariates are "truly" important or significant.

Hypothesis testing is a massive topic. Every reader will come to this blog with there own idea (misconceptions) about what constitutes hypothesis-testing. Here are four different frameworks that are (crudely) about finding a sparse set of "important" covariates.

  • Significance Testing: from R.A. Fisher: uses the p-value to find evidence against a favoured hypothesis.
  • Hypothesis Testing: from Jerzy Neyman and Egon Pearson: capping the long-run rate of Type-I errors at some low threshold (falsely rejecting a null-hypothesis), while trying to minimize the long-run rate of Type-II errors (falsely rejecting the alternative hypothesis).
  • Consistent selection: ability to estimate the coefficients as if the "true model" were known in advanced (e.g., Adaptive Lasso, TwinBoosting), aka Oracle Estimators.
  • Bayesian Model Selection: select the model (and covariates) with highest posterior probability (e.g., with Bayes Factors).

Most people have heard of Fisher's p-value or NP's Hypothesis-testing. Machine Learning practitioners are generally obsessed with sparsity and the consistency property in high-dimensional inference. This can also be thought of as a type of l0 regularization: we are trying to shrink all-unimportant covariates' coefficients to zero, and we want the important covariates to have unbiased estimates (unlike l1 or l2 regularization). Such an estimator is called consistent.

Vanilla boosting has no ability to do hypothesis-inference (no p-values, no guarantees of capping the Type-I errors, no guarantee aboiut consistency, no Bayes Factors). In fact, a component-wise boosting algorithm tuned via cross-validation will tend to have higher false-positives that consistent estimators

However, there are many clever tweaks to the algorithm to get some version of hypothesis-based inference. Two popular choices are:

Let's try Twinboosting on the Diabetes dataset. Twinboosting uses two rounds of boosting. The first round is component-wise boosting, and provides initial estimates of the coefficients in round 2. In round 2, the initial estimates act as a scaling to increasing the selection of the high-weight covariates, and decreasing the selection of low-importance covariates. Amazingly, this leads to a consistent estimator. This type of inference is the following:

  • twinboosting is consistent with large n and a sparse feature set (most covariates have no effect are shrunk to 0);
  • when TwinBoosting selects covariates X, we keep X as our best current candidate for the truth (i.e, a Bayesian interpretation).

Description of TreeBoosting from original Paper

We can also compare the coefficient weights of the l1-regularization (component-wise boosting / Lasso) vs. the coefficients of the l0-regularization (TwinBoosting), as in the Figure below.

# Twinboosting

# twinboosting requires two rounds of boosting (much like the adaptive lasso)
# Round One: Component-wise boosting
cv.round1 <- cv.bst(x=xdata, y=y, K=10, ctrl = bst_control(twinboost=FALSE, mstop=700),n.cores = 5)
# find the best m
best_m1 <- which.min(cv.round1$cv.error)
# fit the first model
mod1 <- bst(x=xdata, y=y, ctrl = bst_control(mstop=best_m1), family = "gaussian", learner = "ls")

# 2nd round of boosting:
# initialize the coefficients with the values form from round 1
cv.round2 <- cv.bst(x=xdata, y=y, K=10, ctrl = bst_control(twinboost=TRUE,coefir=coef(mod1), xselect.init = mod1$xselect, mstop=700),n.cores = 5)

# get the optimal hyperparameter
best_m2 <- which.min(cv.round2$cv.error) <- bst(x=xdata, y=y, ctrl = bst_control(twinboost=TRUE,coefir=coef(mod1), xselect.init = mod1$xselect, mstop=best_m2))

# plot the results

# The model only selects THREE covariates: BMI, LTG, MAP, sex, and HDL
age   0.00000
sex -13.28290
bmi 589.21010
map 188.90932
tc    0.00000
ldl   0.00000
hdl -87.27599
tch   0.00000
ltg 521.75831
glu   0.00000

# let's compare the coefficient estimates from Lasso solution vs. TwinBoosting
coefs <- matrix(c(coef(mod1),coef(, K,2,dimnames = list(names(xdata),c("l1","l0")))
o <- order(abs(coefs[,1]),decreasing=TRUE)
coefs <- coefs[o,]

barplot(t(coefs),beside=TRUE,legend.text=c(expression(hat(beta)*"-\u2113"*1~"(Lasso)"),expression(hat(beta)*"-\u2113"*0~"(TwinBoost)")),main ="TwinBoost vs. Lasso Coefficients\n(Diabetes Data)")

TwinBoosting coefficients vs. Component-wise Boosting coefficients. (Sparse vs. shrunk).

Take Home Message:

  • TwinBoosting yields a sparser set of coefficients than l1-regularization (more coefficients are exactly 0)
  • The less-important covariates are smaller compared to l1-coefficients
  • The more-important covariates are larger compared to the l1-coefficients (remember, we know the l1-coefficients are slightly biased low).

For more information about other ways of doing "Hypothesis Testing" with boosting, see the high-dimensional p-values 10.1198/jasa.2009.tm08647, or high-dimensional error-control of Type-I errors, or posterior inclusion probabilities based on stability selection.


  • Boosting is a statistical modelling technique which uses functional gradient descent to minimize a cost function while sequentially building an ensemble of shrunken weak learners
  • Boosting is a generalization of l2-regularization, l1-regularization, and can be tweaked (like in  TwinBoosting) to get a consistent sparse estimator.

Thursday, 4 October 2018

NLP and LSTM for Classifying Customer Complaints about Financial Products

Some of this work is available as python code at

Topic Embeddings

One of the most interesting concepts to arise out of machine learning in the last decade has been the concept of "Embeddings" (see, for example, "word2vec" developed by Tomas Mikolov at Google). The general idea is to transform categories from discrete, independent objects and "embed" them into a low-dimensional vector space. Each object/category is no longer it's own discrete entity, but is a vector in some space. In the word2vec example, each English word is a vector of size 300, such that variations along each of the 300 dimensions has inferred semantic importance (e.g,. "girl" vs. "boy" vary along a dimension, just as as "queen" vs. "king"). Another example is the famous instacart model, where the analysts sought to embed 10 million grocery-items into 10 dimensions. For my Insight Data project, I sought an embedding for customer-feedback topics from a propriety customer survey dataset (e.g., to organize the various things customers complain about).

From the perspective of a data-analyst, the immediate utility of the embedding approach is as an alternative means of vectorizing categorical variables, and, in particular, finding a vectorization that is low-dimensional. Consider the traditional route of vectorizing categorical data via "One-hot-coding": each category becomes a binary dummy variable in a matrix of size K-1. In contrast, the embedding approach uses deep-learning to vectorize each category into continuous latent vectors, whereby the embedding dimension D is typically much lower than the number of categories D less than K. In my case, I was dealing with about 200 consumer-feedback categories (and growing exponentially!), while my embedding space just had 6 dimensions (see Fig 1) below.
Fig1: Example embedding: various consumer-complaint categories vectorized into 6 dimensions.

Why all the fuss? In each discipline, there is a flurry creative uses for the embedding technology. For my use-case, I was interested in embeddings for two applications:
  • finding redundant categories; and
  • grouping categories into meaningful super-groups (clustering).
For the first goal, it is possible to find redundant/related categories according to the following intuition: vectors that are close in the embedding space are strongly related, or even redundant. We can cluster these putative different categories into super-categories through clustering algorithms, like K-means.

I will walk through a particular problem which invites embedding solution, and another which perhaps does not benefit from embeddings.

Example 1: Multi-class Classification

The following example comes from a propriety dataset, so the context and results are all general, and the code cannot be shared. Nonethless, the results are interesting for two reasons: the use of embeddings, and the pivotal role of the multi-class loss.


While studying ML & data science with Insight Data 2018, I played consultant for a local startup company in Toronto. They had a large data-set of customer feedback in text form. Each row of text was a customer complaint or recommendation regarding a large variety of products, companies and shopping experiences. They employed an army of humans to go through each text feedback and extract customer sentiments (-,0,+). The problem was that these categories/labels were increasing exponentially in time: the number of labels was growing from 50 to 200 (to perhaps 1000?) in a matter of months. How could I scale up this classification & sentiment analysis?

Fig 2: Left: Exponential increase in the number of labels assigned to the customer-feedback text. Right: extreme skew to the distribution of labels in the data: most are rare (perhaps redundant?) a few are well represented.

The scaling issue became very apparent after trying a generic 'go-to' model with a simple natural language processing (NLP) component and deep-learning model. Using python, tensorsflow, and the keras API, the 'go-to' model had the following pipeline: pre-process the text (stemming words, remove stopwords, etc.); vectorize the words of the text with a word-embedding (like word2vec, but trained within the context of the this problem); run the word-vectors through a recurrent neural network (e.g., LSTM); finally, use Multi-class Classification (and conditional sentiment analysis) to assign category-labels to each customer-review.


The go-to model performed well for the most abundant 20 categories, but it didn't scale well for entire dataset of >200 categories. My worry was that the exponential increase in the number of categories meant that there was considerable redundancy in the ever-increasing quantity of (putative) categories.

Solution: Embedding

Inspired by Instacart and the lessons at, I thought it might be helpful to try and reduce the number of "effective" categories through embeddings. Basically, the algorithm learns an embedding space, different regions of the space have "meaning", and each category is merely a that point exists in this space. Related Categories like "free samples" and "sign-up incentives" are close together in this space, while categories like "website layout" and "shipping costs" occupy a different part of the space.

The neural architecture is shown in figure 3. And the keras API looks like:

# X_train: the vectorized text data (training set)
# Y_train: is a 3D tensor of targets of conditional targets: 
# ... : the rows represent observation
# ... : the cols represent categories
# ... : the slices represent the sentiments per category [NA,-,0,+]
# X_topics_train : just a vector of integers 1:N_labels
# W_train : sample-weights for unbalanced design

embed_dim_lstm = 194 # word-embedding dimensions
lstm_OutputDim = 100 # output dimensions of the LSTM
embed_dim_topic = 6 # category embedding dimensions
batch_size = 128 # mini-batch size
hidden_nodes_final = (np.linspace(lstm_OutputDim,Ymultclass.shape[1],3).round().astype(int))[1] # dimensions of the final hidden layer

lstm_input_layer = keras.layers.Input(shape=(X_train.shape[1],), dtype='int32', name='lstm_input',) # lstm input
lstm_embed_layer = keras.layers.Embedding(input_dim=max_features, output_dim=embed_dim_lstm, input_length=X_train.shape[1])(lstm_input_layer) # input_dim = vocab size,
lstm_output = keras.layers.LSTM(lstm_OutputDim)(lstm_embed_layer) # the output has dimension (None, 12)
# need to Repeat the LSTM output from a 2D matrix to a 3D tensor in order to concatenate to the category-embedding
lstm_reshape = keras.layers.RepeatVector(nTopics)(lstm_output)
topic_input_layer = keras.layers.Input(shape=(X_topics_train.shape[1],), dtype='int32', name='topic_input') # input the topics
topic_embed_layer = keras.layers.Embedding(input_dim=nTopics, output_dim = embed_dim_topic, input_length=X_topics_train.shape[1])(topic_input_layer) # topic embedding
# need to reshape
x = keras.layers.concatenate([lstm_reshape,topic_embed_layer],axis=2) # is this axis 1 or 2??
hidden_layer = keras.layers.Dense(hidden_nodes_final, activation='relu')(x)
#x = keras.layers.BatchNormalization()(x)
main_output = keras.layers.Dense(4, activation='softmax', name='main_output')(hidden_layer) # main output for categories
model = keras.models.Model(inputs=[lstm_input_layer,topic_input_layer], outputs=[main_output])
model.compile(loss = "categorical_crossentropy", optimizer='adam',metrics = ['accuracy'])
print(model.summary()) # I guess I can use both{'lstm_input': X_train, 'topic_input': X_topics_train }, # inputs
          {'main_output': Y_train}, # targets
          epochs = 50, # 
          batch_size=batch_size, verbose = 2,
          validation_data=([X_val, X_topics_val], Y_val), # validation data
          sample_weight = W_train) # sample-weights for unbalanced design

After compiling the model, the keras API returns the following list of layers and their shape.

Layer (type)                    Output Shape         Param #     Connected to                     
lstm_input (InputLayer)         (None, 198)          0                                            
embedding_1 (Embedding)         (None, 198, 194)     582000      lstm_input[0][0]                 
lstm_1 (LSTM)                   (None, 100)          118000      embedding_1[0][0]                
topic_input (InputLayer)        (None, 145)          0                                            
repeat_vector_1 (RepeatVector)  (None, 145, 100)     0           lstm_1[0][0]                     
embedding_2 (Embedding)         (None, 145, 6)       870         topic_input[0][0]                
concatenate_1 (Concatenate)     (None, 145, 106)     0           repeat_vector_1[0][0]            
dense_1 (Dense)                 (None, 145, 122)     13054       concatenate_1[0][0]              
main_output (Dense)             (None, 145, 4)       492         dense_1[0][0]                    
Total params: 714,416
Trainable params: 714,416
Non-trainable params: 0

Figure 3: Diagram of the above Keras API model. The top arm is a generic text-classification model (word-tokens -> word embedding -> LSTM), while the bottom arm includes the "category embeddings". The outputs of the LSTM and the category-embeddings are concatenated before running through a final Dense layer. The final targets are multi-class labels (x-axis) and their conditional sentiments (NA,-,0,+) along the z-axis

According to the out-of-sample AUC statistics, the model performed quite well (>0.95). What is especially interesting is the learned relationships among the different categories, as visualized in figure 4 (below). The fact that related categories cluster together in space demonstrates that the model does indeed learning meaningful embeddings of the categories.

Fig 4: Consumer feedback categories and their learned position in a low-dimensional embedding space (here, visualized in 2D with t-SNE. 
There are a couple interesting things to note about the neural architecture:

  1. the 'generic' LSTM model is still present in this embedding model: it merely represents the chain of layers: `lstm_input_layer`, `lstm_embed_layer`, `lstm_output` (representing the word-embedding and the LSTM layers).
  2. the category-embedding goes through the chain: `topic_input_layer`,  `topic_embed_layer`
  3. in order to combine outputs of the LSTM chain with the outputs from the category-embedding chain, I had to repeat the outputs of the LSTM so that it had the same dimensions as the category-embedding in the X and Y axes (i.e., concatenating along the Z axis): `lstm_reshape = keras.layers.RepeatVector(nTopics)(lstm_output)`

Loss function:  Importance of the Multi-class loss

Notice also, that both the final layer (`hidden_layer`) and the Y targets are 3D dimensional. This has a very important implication for the `categorical_crossentropy` loss function: the unit-vector which sums to 1 is taken along the z-axis; in the model specification, this z-axis represents the 4 different sentiment categories per a particular category: [NA = not-present; negative; neutral; positive]. The columns are just different possible categories, each with their own sentiment-unit-vector heading off in the Z-axis direction. This is summarized in the figure 5. This has two important points:

  • First, the entire analysis is "multi-class" classification, such that each piece of text (a row in the matrix) can have multiple labels (which makes sense, because people can and do talk about different things in their customer feedback.
  • Second, the rows (different texts) and columns (different categories) are essentially treated as independent observations, according to the loss function!

Figure 5: structure of data-targets and the loss function. In Keras, if the targets is 3D (here, text on y-axis, categories on x-axis, and conditional sentiments on z-axis), and you choose a "categorical_crossentropy" loss function, then the distribution is assumed multinomial over the Z-axis. This means, the probability of each element in the z-axis sums to one (sentiments) while the columns and rows are all independent.

Therefore, we see why the embeddings may help the predictive performance of the model, and why the model is able to learn meaningful embeddings of the categories: were the model not to learn the associations among the categories (via the embeddings), then each category would contribute an independent amount to the loss function, irrespective of all the other categories, per sentence. With the embeddings, the model is learning a structure that helps it predict which columns/categories are related.

The specification of this loss function (such that each category is independent in the eyes of the loss function) is the crux of the embeddings usefulness. In the next section, I will do another seemingly similar analysis according to a different loss function, with very different results.

Example 2: Multinomial Classification

Due to the secretative nature of company I was working with, I wanted to show an alternative analysis using publicly available data for text-classification.  I settled on the USA Consumer Financial Protection Bureau's financial complaint database