Thursday 25 October 2018

Boosting Deep Dive: from MLEs to Hypothesis Testing

I wanted to write a small tutorial outlining some of the key concepts of "boosting". Boosting is a broad topic that is related to many other methods (e.g., Random Forest, SVM, bagging, Lasso), in addition to have many flavours (adaboost, component-wise boosting, boosted regression trees, twinboosting). In this tutorial, we'll walk through the basics of boosting and visualize some of its key statistical properties. Eventually, we'll build-up to the powerful XGBoost (winner of so many kaggle competitions). We'll show how this powerful learner comes from a simple algorithm which encompasses not only XGBoost, but also l2- and l1-regularzation.

Part of the confusing nature of boosting is that its core concepts have arisen from different disciplines. The majority come from workers in machine learning and learning theory. Later, statisticians united the myriad methods under a statistical framework. For more on this history, I recommend reading Buhlmann and Hothorn or Mayr and Ratch for the modern synthesis across methods. Boosting is very general and comprehensive.

Let's start off with a definition of boosting: "Boosting is a statistical modelling technique which uses functional gradient descent to minimize a cost function while sequentially building an ensemble of shrunken weak learners".

Key Concepts

I want the audience to understand these four key concepts:

  • how Functional Gradient Descent works
  • what are Weak Learners and why are they crucial for boosting
  • what is an Ensemble of weak learners
  • what is shrinkage

These concepts separate boosting from similar methods. Later in the blog, I'll discuss other important concepts like l1-regularization, l2-regularization, and hypothesis testing / l0-regularization.

Key Notation

  • X: A matrix of covariate/independent effects
  • Y: A response variable (here, normally distributed)
  • F: "fit-vector" this is our predicted value of y (on the working scale)
  • g(y,x): a learner; tries to predict Y from X: g(X)->Y
  • r: negative-gradient
  • nu: learning rate, less than 1 

Weak Learners

Weak Learners produce a classifier that is slightly better than random guessing (> 50%). The foundations of boosting arise from the insight that weak learners can be iteratively improved (boosted) to produce an overall strong learner. Schapire and Freund developed the first boosting algorithms, such as Adaboost.

Example with trees:
A weak learner may be a CART-model with a low maximum tree-depth (figure below, right). It is biased, but low variance (i.e, it doesn't vary much with a new sample of data). A strong CART learner would have a higher maximum tree depth (figure below, left), given it more flexibility and yielding less-biased estimates. But, its estimates would vary more with a new sample of data.

CART Learner: strong vs. weak: low-bias high-variance (left), high-bias low-variance (right)

Example with linear-learners (least squares learners, splines, etc).
In it's normal operations, a least-squares learner provides the maximum-likelihood solution to a linear regression between covariates X and continuous response Y (figure below, top). Least-Squares learners can also be weak or strong. For example, a weak version of a least-squares learner would be penalized least-squares (aka, ridge-regression). A strong ridge-penalty decreases the learner's flexibility, complexity, and degrees-of-freedom: it is more constrained and must discriminate among its potential covariates. Like the week CART learner, it is also biased and low variance. In contrast, an unpenalized least-squares learner will just do vanilla linear-regression: certainly a powerful solution that has been the backbone of data analysis for over 100 years.

Least Squares Learner: strong (unpenalized) vs weak (penalized)
Why does weakness matter?
The "weak learner" is a key feature of boosting. Such weak learners separate boosting from the random forest algorithm. In random forest, one creates an ensemble of strong-learners: each is a complex, high-variance tree-learner with a deep maximum tree-depth. In contrast, in boosted regression trees, one uses trees with relatively low max-depth. Other types of learners are possible, like splines, radial basis functions, least-squares: so long as we have some way to heavily penalize complexity and keep them weak.

Why the emphasis on weak learners? Weak Learners must make the best of a bad situation: they must "choose" one or a few of the features, within the bounds of their contraints, whereas strong learners can through everything at a problem.

More formally, the weakness of learners is one of the ways in which boosting is able to negotiate the famous "bias-variance trade-off": at low boosting iterations, the model is simple and biased; at greater boosting iterations, the model is more complex, less-biased, and higher variance. Negotiating the number of boosting iterations prevents overfitting. If we use cross-validation to "tune" the number of boosting iterations, then Buhlmann and Yu showed how boosting has an "adaptive complexity" at the level of the entire ensemble: it can be either low complexity or high complexity, governed by the number of boosting iterations. This can be seen in the animation below, where the fitted spline has maximum degrees-of-freedom 4 (like 4th order polynomial), while the data is clearly 5th or 6th order (it is a sin curve). Nonetheless, by combining twenty 4th-order splines through boosting, the final model clearly can approximate the higher-order data. Weak learners are boosted to yield a strong, complex ensemble-learner.


Functional Gradient Descent

You've most likely heard of "gradient descent". Figure XXX shows how a parameter (beta) can be iteratively updated by incrementing its value according to the gradient of the loss function. Note that it is the parameter that is updated. Let's contrast this with FGD.

Gradient Descent vs. Functional Gradient Descent


In FGD, we literally fit a learner to the negative gradient (called g). We are iteratively updating the entire "fit-vector" (our best estimate of Y). This is an iterative process, such that as we descend the gradient of the loss function, we generate many fitted-learners, each of which has been fitted at different slopes along the gradient descent. The full algorithm is shown above and in the code below.

Why are we iteratively fitting the negative gradient? An intuitive explanation comes from the early boosting algorithm Adaboost. In that classification algorithm, one sequentially fits the parts of the data that are difficult to classify in earlier iterations. In other words, the algorithm focuses on those cases with very high residuals (lots of unexplained variance).

Functional Gradient Descent: (top) the current residuals at boosting iteration m, the purple learner is fit to the residuals, and the orange line is the shrunk versions, shrunk by the learning rate; (bottom) the ensemble learner is just the sum of all the previously fit learners.


With a Gaussian loss function, the negative gradient is just the residuals: y-f. Thus, by fitting the negative gradient, we are likewise forcing our boosting to first do-away with the easy cases (small residuals), and then sequentially focus on the tougher cases (bigger residuals). This notion of "adaptive residuals" is generalized for other non-Gaussian loss functions by using the negative gradients.


# Gaussian Loss
(y - f)^2
# Gaussian negative-gradient
y - f

#Binomial Loss:
p <- exp(f)/(exp(f) + exp(-f))
-y * log(p) - (1 - y) * log(1 - p)

#Binomial negative gradient:
exp2yf <- exp(-2 * y * f)
-(-2 * y * exp2yf)/(log(2) * (1 + exp2yf))

# Negative Binomial Loss
-(lgamma(y + exp(f)) - lgamma(exp(f)) - lgamma(y + 1) + exp(f) *  
     f + y * log(mu) - (y + exp(f)) * log(mu + exp(f))) 

# Negative Binomial negative-gradient
mu:  y - (y + sigma)/(exp(f) + sigma) * exp(f)
sigma:- exp(f) * (digamma(y + exp(f)) - digamma(exp(f)) +
                 log(exp(f)) + 1 - log(mu + exp(f)) -
                 (exp(f) + y)/(mu + exp(f)))

# see ?Family from package mboost for more


Ensemble Learning

If we ran the boosting algorithm for 20 iterations, then we would have 20 fitted learners. This forms an ensemble of leaners. The final model is the aggregate of sum all the learners. In order to make predictions on new data, each learner would have to make a prediction, and those predictions are summed (element-wise). The aggregation is shrunk by a learning rate. The learning rate is small, so that the outputs from each weak learner is small.

The way we aggregate the predictions of the ensemble of learners is different from Bagging methods (like Random Forest) which aggregate the learners in some other way (like majority voting).

Shrinkage

A key part of boosting is that we "stop short" the algorithm. The number of iterations is determined by, e.g., cross-validation or the AIC. For least-squares learners, if we ran the algorihtm until infinity, we would eventually get the Maximum Likelihood Estimate of the parameters. For some smaller number of iterations, the estimated parameters will be shrunk towards zero. Copas 1983 discovered the importance of shrinkage for prediction.

The shrinkage phenomenon is strange. It is related to the James-Stein effect for estimating the means of a multivariate Gaussian. Copas then noticed a similar effect for GLM models. Boosted models are a type of shrinkage estimator.

The shrinkage manifests as a slight bias-towards-zero (away from the MLE). We now understand that this is a way of negotiating the classic Bias-Variance Trade-off. For more on the Bias-Variance trade-off and shrinkage, see my blog post at called "A primer on the Bias-Variance Trade-off".

Many non-statistician researchers are shocked to see such obvious bias. But it is strategic: we are incurring some bias so to reduce the variance and reduce the overall Expected Risk (the Expected Mean Square Error). Having a lower Expected Risk means that we are more likely to be closer to the truth even if we are systematically biased. If you don't grok this idea of the bias-variance trade-off, then Copas (1997) offers an intuitive explanation: "One of the classic examples of regression to the mean is the inheritance of height. Tall fathers tend to have shorter sons, and short fathers tend to have taller sons. Here y1 (father's height) and y2 (son's adult height) are two random variables with (almost exactly) the same distribution, but the conditional expectation E[y2|y1] is not y1 but a value closer to the overall mean."

PART 1: Example of Functional Gradient Descent I: Shrinkage Estimator

I will now show FGD in action by manually performing FGD with just one Least Squares learner (e.g,. a multiple regression learner). The dataset is the famous diabetes data (442 observations, 10 covariates) Efron et al "Least Angle Regression" paper. Our learner will use all 10 covariates: age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu

Let's first load the data and calculate the MLEs


library(MASS) # get the lm.ridge function
library(lars) # get the LASSO/LAR functions
data(diabetes) # data set to analyze

# response variable: y
y <- diabetes$y

# explanatory covariates
x <- scale(diabetes$x) # standardize the covariates
xdata <- setNames(as.data.frame(matrix(as.numeric(diabetes$x),nrow(diabetes$x),ncol(diabetes$x))),nm=colnames(x))

# nummber of observations
n <- nrow(xdata)

# number of covariates
K <- ncol(xdata)

# show the maximum likelihood estimates
coef(lm(formula = y~.,data=data.frame(cbind(y=y,xdata))))

MLEs of the diabetes data:


> coef(lm(formula = y~.,data=data.frame(cbind(y=y,xdata))))
(Intercept)         age         sex         bmi         map          tc         ldl         hdl         tch         ltg         glu 
  152.13348   -10.01220  -239.81909   519.83979   324.39043  -792.18416   476.74584   101.04457   177.06418   751.27932    67.62539 



Now let's define our learner: it does very simple thing, just returns a function ("fit") that calculates the least-squares solution for a given Y.


# Learning: Least Squares:
LeastSquaresLearner <- function(X, penalty = 10^-7){
    # X: input the covariates (X); 
    # penalty: possible ridge penalty
    
    # add intercept to the data: X
    X <- as.matrix(cbind(intercept = rep(1,nrow(X)),X))
    
    # (optional) add ridge penalty as a diagonal matrix
    Omega = diag(ncol(X))*penalty

    # X'X  + Ridge Penalty
    XtX <- crossprod(X, X)+Omega

    # Fitting function: get least-squares solution
    fit <- function(y){

        # get betas: inv(Xt*X+Omega) * Xt * y 
        beta_hat = solve(XtX, crossprod(X, y)) # inv(Xt*X+Omega)*Xt * y 
        beta_hat = as.numeric(beta_hat)

        # prediction function
        predict <- function() X%*%beta_hat 

        # return fitted learner
        return(list(
                beta = beta_hat, # LS solution
                predict = predict # provides the fit vector
            )
        )
    }
    
    # outer return 
    return(list(fit = fit))
}

# try the learner
l1 = LeastSquaresLearner(xdata)
l1_fit = l1$fit(y)
f = l1_fit$predict()

Finally, we'll perform Functional Gradient Descent. A each iteration we: update the residuals/negative-gradient (r) by taking the difference between the fit-vector (F) and the response variable (Y); then fit the learner to the negative-gradient (g), increase our fit-vector by the outputs from the learner, shrunk by the learning-rate (nu).

Key features:


  • the learner is NOT weak, it is an unpenalized multi-regressor.
  • there is only ONE learner: the learner performs multiple-regression.
  • this is for illustrative purposes only. No one would do this in reality (there are much better estimators)
  • boosting hyperparameters: the number of boosting steps is and the learning-rate.


# PART 1 : SHRINKAGE (Copas 1983)

# let's compare to functional gradient descent!
mstop = 1200 # number of iterations
nu= 0.05 # learning rate 
beta_FGD <- matrix(0,mstop, ncol(xdata))

# intialize the learner
ls_learner = LeastSquaresLearner(xdata)

# initialize the model: model of the mean
y_mean <- mean(y)
F <- rep(y_mean,n)

# initialize the gradient
negative_gradient <- function(y,F){ y-F } 
r <- negative_gradient(y,F)

# risk function: sum of square error
loss <- function(y,F) (y-F)^2
risk <- function(y,F) sum(loss(y,F))
vRisk <- numeric(mstop)

# boosting
for(m in 1:mstop){

    # fit the learner to the residual
    g <- ls_learner$fit(r)

    # update the Model
    F <- F + nu * g$predict()

    # recalculate the gradient
    r <- negative_gradient(y, F)

    # store: empirical risk
    vRisk[m] <- risk(y,F)
    
    # store the coeffcients (minus the intercept)
    beta_FGD[m,] <- nu*g$beta[-1] 

}

# Betas: per boosting step
beta_boost <- apply(beta_FGD, 2,cumsum)
# Beta: maximum likelihood estimation
beta_MLE <- coef(lm(y~.,data = data.frame(cbind(y=y,xdata))))

# plot the regularization pathways
plot(range(log(1:mstop)), range(beta_boost),type="n", xlab = "log(boosting step)", ylab = expression(hat(beta)),main = "Functional Gradient Descent:\n Regularization Pathways")

# plot each coefficient
for(k in 1:ncol(beta_FGD)){
    lines(x=log(1:mstop), y=beta_boost[,k], col=k)
}    

# add MLEs
points(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, pch=19)
text(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4)
text(log(mstop+1), max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA)


Regularization Pathways: as we go to the left (more steps in the gradient descent) the regression coefficients increase until they reach the Maximum Likelihood Estimates (right). 

Key Takeaways (Shrinkage Estimator)


  • the FGB eventually produces the Maximum Likelihood Estimates as the number of boosting iterations gets very large.
  • at smaller boosting iterations, the betas are shrunk towards zero but none are exactly zero.
  • this is most similar to the "pre-shrunk" estimators from Copas 1983 (no one uses these anymore).
  • there is no feature selection at lower m: all betas are shrunk-towards-zero by a constant from their MLE's (i.e, early stopping doesn't favour one covariate vs. another). The strong learner makes use of every single feature at every iteration.

PART 2: Example of Functional Gradient Descent II: Ridge Regression (l2-regularization)


In the next example, we will use the same learner from above, but the learner will be penalized. We will apply a strong ridge penalty that reduces the learners degrees of freedom, i.e., the learner will have to start "choosing" among features within the bounds of its constraints.


Key Features:


  • there is only ONE learner: the least-square learner performs penalized multiple-regression.
  • the penalized learner is weak compared to the previous section.
  • hyperparameters: mstop=7000; learning-rate nu=0.5; the learner-penalty of 7



# PART 2: L2 REGULARIZATION

mstop = 7000 # number of iterations
nu= 0.5 # learning rate 
beta_FGD <- matrix(0,mstop, ncol(xdata))

# intialize the learner
ls_learner = LeastSquaresLearner(xdata,penalty=7)

# initialize the Fit-vector
y_mean <- mean(y)
F <- rep(y_mean,n)
# goal: get F close to Y (minimizes loss)

# risk function: sum of square error
loss <- function(y,F) (y-F)^2
risk <- function(y,F) sum(loss(y,F))
vRisk <- numeric(mstop)

# initialize the gradient
negative_gradient <- function(y,F){ y-F } 
r <- negative_gradient(y,F)

# boosting
for(m in 1:mstop){

    # fit the learner to the residual
    g <- ls_learner$fit(r)

    # update the Fit Vector
    F <- F + nu * g$predict()

    # recalculate the gradient
    r <- negative_gradient(y, F)

    # estimate the loss
    vRisk[m] <- risk(y,F)
    
    # store the coeffcients (minus the intercept)
    beta_FGD[m,] <- nu*g$beta[-1] 

}

# Betas per boosting step
beta_boost <- apply(beta_FGD, 2,cumsum)
# Beta: maximum likelihood estimation
beta_MLE <- coef(lm(y~.,data = data.frame(cbind(y=y,xdata))))

# plot the regularization pathway
par(mfrow=c(2,1))
plot(range(log(1:mstop)), range(beta_boost),type="n", xlab = "log(boosting step)", ylab = expression(hat(beta)),main = "Functional Gradient Descent:\n Regularization Pathways")

# plot each coefficient
for(k in 1:ncol(beta_FGD)){

    lines(x=log(1:mstop), y=beta_boost[,k], col=k)
    
}

# add MLEs
points(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, pch=19)
text(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4,xpd=NA)
text(log(mstop+1), max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA)

# Compare to ridge regression
penalty_seq <- exp(seq(log(10000),log(0.2),length.out = 50))  # 0.01 to 10

# fit ridge regr for each value of the penalty_seq
lridge <-  lm.ridge(y ~ age + sex + bmi + map + tc + ldl + hdl + tch + ltg + glu, data = data.frame(cbind(y=y, xdata)),lambda = penalty_seq)

# all ridge solutions
beta_ridge <- coef(lridge)[,-1] # -1 ignores the intercept

# make a plot
plot(rev(range(log(penalty_seq))), range(beta_ridge),type="n", xlab = "ridge penalty", ylab = expression(hat(beta)), xaxt = "n", main = "Ridge: Regularization Pathways")
axis(1,at=rev(log(penalty_seq)), labels=sprintf("%0.5s",penalty_seq),las=2)
# plot each coefficient
for(k in 1:ncol(beta_ridge)){

    lines(x=rev(log(penalty_seq)), y=beta_ridge[,k], col=k)
    
}

# add MLEs
points(rep(max(log(penalty_seq)),K), beta_MLE[-1], col = 1:K, pch=19)
text(rep(max(log(penalty_seq)),K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4)
text(max(log(penalty_seq)), max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA)

For comparison, I have shown the "regularization pathways" of a Ridge Regression solution. From left to right, we are weakening the regularization of the Ridge Regression (lambda hyperparameter), until eventually there is no penalty and the solutions are the MLEs.

Notice any similarity between the regularization pathways of the boosted penalized least-squares learner and the ridge-regression?

Regularization Pathways for l2-regularizer: left is stronger regularization, right is weaker regularization

Key Takeaways.


  • boosting can approximate the solutions of l2-regularization (Ridge Regression).
  • a penalized/weak learner is forced to discriminate among features, which begets some feature learning. You can see this in the above regularization pathways in which the betas have different relative magnitudes at different steps (e.g, "tc" is disfavoured until about exp(6)=403 boosting iterations, after which its beta value explodes in magnitude).
  • at smaller boosting iterations, the betas are shrunk towards zero but none are exactly zero
  • the FGB obtains the MLEs as the number of boosting iterations goes to infinity

NOTE The optimal number of boosting iterations should be decided by cross-validation or bootstrap validation. The above was just for educational purposes.


PART 3: Example of Functional Gradient Descent III: Component-wise Boosting


In the next example, we will use a different learner per covariate. In Examples I and II, there was one least-squares learner, which performed multiple regression using all the covariates. Our next algorithm, which uses a different learner per covariate, is called component-wise boosting. The key is that for each boosting iteration,  only one learner/covariate is picked to enter the ensemble.

Something magical happens with component-wise boosting :)

Key Features:


  • there is one learner per covariate.
  • each learner competes to fit the gradient and add it's fit to the overall fit-vector F.
  • boosting hyperparameters: the number of boosting steps is mstop=5000 and the learning-rate is 0.2; the learners are unpenalized


mstop = 5000 # number of iterations
nu= 0.2 # learning rate

# store the coefficients per boosting step
beta_FGD <- matrix(0,mstop, ncol(xdata))

# store the learners
ensemble <- vector(mode="list",length = mstop)

# intialize the learners
# notice there are 10 learners: one learner per covariate
ls_learners = setNames(vector(mode = "list", length = K),nm=colnames(xdata))
for(k in 1:K){
    ls_learners[[k]] <- LeastSquaresLearner(xdata[,k,drop=FALSE])
}
    
# initialize the fit-vector (F) current estimate of y
y_mean <- mean(y)
F <- rep(y_mean,n) # Fit-vector
# goal: get F close to Y

# risk function: sum of square error
loss <- function(y,F) (y-F)^2
risk <- function(y,F) sum(loss(y,F))
vRisk <- numeric(mstop)

# initialize the gradient
negative_gradient <- function(y,F){ y-F } 
r <- negative_gradient(y,F)

# dummy list: temporary storge of learners
g.list <- setNames(vector(mode = "list", length = K),nm=colnames(xdata))

# boosting
for(m in 1:mstop){

    # fit EACH learner to the residual
    g_SS <- numeric(K)
    for(k in 1:length(ls_learners)){
        # fit each learner
        g.list[[k]] <- ls_learners[[k]]$fit(r)

        # calculate GOF statistic per learner
        g_SS[k] <- sum((g.list[[k]]$predict()-r)^2)
    }

    # get best learner
    best_learner <- which.min(g_SS)
    g <- g.list[[best_learner]]

    # update the Fit Vector 
    F <- F + nu * g$predict()

    # recalculate the gradient
    r <- negative_gradient(y,F)

    # store the empirical risk
    vRisk[m] <- risk(y,F)
    
    # store the coeffcients (minus the intercept)
    beta_FGD[m,best_learner] <- nu*g$beta[-1] 

    # add learner to the ensemble (optional)
    ensemble[[m]] <- g
    
}

# add up all the coefficients per learner
beta_boost <- apply(beta_FGD, 2,cumsum)

par(mfrow=c(2,1),bty="n")
plot(range(log(1:mstop)), range(beta_boost),type="n", xlab = "log(boosting step)", ylab = expression(hat(beta)), ,main = "Functional Gradient Descent:\n Regularization Pathways")
# plot each coefficient
for(k in 1:ncol(beta_FGD)){

    lines(x=log(1:mstop), y=beta_boost[,k], col=k)
    
}
# ADD mles
points(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, pch=19)
text(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4,xpd=NA)
text(log(mstop+1), max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA)

# LASSO: estimate the lasso betas
library(glmnet)
lasso <- glmnet(x=as.matrix(xdata),y = y, alpha=1,lambda=exp(seq(log(45),log(0.025),length.out=1000)))
beta_lasso <- t(lasso$beta)
vLambda <- lasso$lambda

# plot the lasso betas 
rangebeta = range(beta_lasso)
nsteps = nrow(beta_lasso)
plot(c(1,nsteps), rangebeta,type="n", xlab = "LASSO regularization steps",ylab = expression(hat(beta)))
for(k in 1:ncol(beta_lasso)){
    lines(1:nsteps, beta_lasso[,k], col = k)
}
# ADD MLES
points(rep(nsteps,K), beta_MLE[-1], col = 1:K, pch=19)
text(rep(nsteps,K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4,xpd=NA)
text(nsteps, max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA)


For comparison, I have shown the "regularization pathways" of Lasso Regression. From left to right, we are weakening the regularization on the Lasso (lambda hyperparameter), until we get to the far-right solution where we basically obtain the MLEs. Towards the left (stronger regularization), notice that only a few of the covariates have every been picked and have non-zero coefficients. In other words, at low iterations, the Lasso provides a sparse solution: some of the less-important covariates have coefficients that are exactly zero. This is a type of feature selection: the algorithm is learning to pick and chose among covariates, shrinking some to absolute zero.

Notice any similarity between the regularization pathways of the component-wise boosting and the Lasso? In general, they are similar but not exact. We say that they are approximately the same, and that component-wise boosting provides a type of l1-regularization (Hastie,Tibshirani,Friedman 2001).

l1-Regularization: left is stronger regularization, right is stronger regularization

Key Takeaways.


  • boosting can approximate the solutions of l1-regularization (Lasso regression).
  • component-wise boosting and l1-regularization provide automatic feature selection.
  • the FGB obtains the MLE as the number of boosting iterations gets large.
  • at smaller boosting iterations, some betas are shrunk towards zero (i.e., a sparse solution).
NOTE The optimal number of boosting iterations should be decided by cross-validation or bootstrap validation. The above was just for educational purposes.

The benefit of using component-wise boosting over a LASSO model is that we can have very complex combinations of learners. For example, in the R package mboost, we might like to have some interactions between sex and the other covariates.

Why all the excitement for l1-regularization?

A key feature of Lasso-like solutions is a property called minimax optimal rate of convergence in high-dimensional regression. In English, this means that our worst-case theoretical error is minimized as fast as possible with growing n. Among frequentists, minimax-optimality is just one of many desirable properties of an estimator: it provides a theoretical guarantee that in the worst-case (high error), we'll still beat all the other losers.

Furthermore, l1-regularization also often has the lowest Prediction Errors (e.g., Expected Errors). Let's demonstrate this through cross-validation, comparing the Shrinkage model, the l2-regularization model, component-wise boosting model, and of course Maximum Likelihood Estimates (MLE)

We will run all these models using the R package mboost.  The least-squares learners are exactly the same as in the previous examples. See ?bols for an in-depth description of how mboost's formula and learner specification.


# Cross-Validation: compare four estimators in terms of prediction error
# Models: MLEs, Shrinkage, Ridge, and Component-wise boosting

library(mboost) # boosting library

# join Y and X into a single dataframe
alldata <- data.frame(cbind(y=y, xdata)) 

# make the cross-validation weights
cv10f <- cv(weights = rep(1,n), type = "kfold")

# re-run each model in mboost: Shrinkage, l2/Ridge, l2/Lasso, MLE
# boosting with one unpenalized learner (not-boosting technically, just a Copas shrinkage model)
m.shrinkage <- mboost(y~bols(age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu), data = alldata,control=boost_control(mstop=10000,nu=0.2))
# boosting with one penalized learner (l2-regularization) 
m.l2 <- mboost(y~bols(age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu,lambda=7), data = alldata,control=boost_control(mstop=10000,nu=0.2))
# component-wise boosting 
m.l1 <- mboost(y~bols(age)+bols(sex)+bols(bmi)+bols(map)+bols(tc)+bols(ldl)+bols(hdl)+bols(tch)+bols(ltg)+bols(glu), data = alldata,control=boost_control(mstop=10000,nu=0.2))
m.MLE = glm(y~.,data=alldata)

# do 10-fold cross-validation on each model
cvfit.shrink <- cvrisk(m.shrinkage, folds = cv10f, papply = mclapply,mc.cores=5) # copas model (not-boosting)
cvfit.12 <- cvrisk(m.l2, folds = cv10f, papply = mclapply,mc.cores=5) # l2 (ridge)
cvfit.11 <- cvrisk(m.l1, folds = cv10f, papply = mclapply,mc.cores=5) # l1 (component-wise)
cvfit.MLE <- lapply(1:10,function(k){
    # insample weights
    cv.wts<-cv10f[,k]
    # run cv glm
    mle<-glm(y~.,data=alldata,weights=cv.wts)
    # loss function (y-f)^2
    sum(((y-mle$fitted)*!cv.wts)^2)/sum(!cv.wts)
})        

# Expected Error (according to optimal boosting step per model)
cvError = c(
    "MLE" = mean(unlist(cvfit.MLE)),
    "Shrinkage Model (Copas)" = min(colMeans(cvfit.shrink)),
    "l2 (Ridge)" =  min(colMeans(cvfit.12)),
    "l1 (Component-wise)" =  min(colMeans(cvfit.11))
)

# plot the CV-errors
par(mar=c(11,5,2,0),mgp=c(4,1,0))
barplot(cvError,ylim=range(pretty(cvError)),las=2,ylab="Expected Error (10-fold CV)", main = "Component-wise boosting wins, with lowest predictive error\n (\u21131-regularization) ",xpd=FALSE,col="pink")


Comparison of prediction error from cross-validation, according to four different types of estimators. Lower prediction error is better.

Why this order of predictive performance? From worse to best:

  1. Maximum likelihood model has no shrinkage, no feature learning
  2. The shrinkage model has shrinkage, but no feature learning
  3. The l2 model has shrinkage, and some weak feature learner
  4. The l1 model (component-wise boosting) has shrinkage and does automatic feature learning, successfully setting some unimportant covariates to zero

PART 4: Example of Functional Gradient Descent IV: Boosted Regression Trees.


In the next example, our learner will be a single (weak) decision trees from the package rpart (which is a CART model). Despite this wildly different learner, the FGD algorithm is the same: we sequentially fit the learner to the negative gradient (residuals1). The only thing that has changed is the learner: a CART model with small maximum depth vs penalized least-squares.

We will compare this boosted regression tree model to the FAMOUS and POWERFUL XGBoost! (go-to solution for many Kaggle competitions, and rightly so). We start by defining our learner:


library(party)
TreeLearner = function(X, control = ctree_control(maxdepth=2,mtry=round(0.7 * ncol(X)),mincriterion=0), subsample = 1){

    mf <- as.data.frame(X)
    n <- nrow(mf)
    if(subsample <1){
        get_weights <- function() tabulate(sample(1:n,size=round(n*subsample)),n)
    } else {
        get_weights <- function() rep(1,n) 
    }
    # fit a PARTY model (conditional inference tree)
    fit = function(y){
        # fit a tree model
        btreemodel <-ctree(y ~ ., data = data.frame(cbind(y=y,mf)), weights = get_weights(), controls=control)
        # return the predictions
        predict <- function(newdata=NULL) btreemodel@predict_response(newdata)
        return(list(
            beta=btreemodel,
            predict = predict
        ))
    }
    return(list(fit = fit))
}

Key Features:


  • one CART learner, with hyperparameters: maxdepth (2), minbucket (7), subsampling rate 0.66
  • for each boosting iteration, the learner gets a random subsample of rows of data (like bagging!)
  • for each boosting iteration, the learner gets a random subsample of columns of features (like random forest!)
  • boosting hyperparameters: the number of boosting steps is 200 and the learning-rate is 0.02.



# initialize the fit-vector
y_mean <- mean(y)
F <- rep(y_mean,n)

# initialize the gradient
negative_gradient <- function(y,F){ y-F } 
r <- negative_gradient(y,F)

# initialize the learner
learners <- TreeLearner(xdata,subsample=0.66)

# initialize the ensemble
mstop <- 200
nu <- 0.02
ensemble <- vector(mode = "list", length = mstop)

# tree boosting
for(m in 1:mstop){

    g <- learners$fit(r)
    
    # update the Fit-vector
    F <- F + nu * g$predict()

    # recalculate the gradient
    r <- negative_gradient(y,F)

    # add the learner to the ensemble
    ensemble[[m]] <- g

}


# fit an XGBoost model
library(xgboost)
xgmod <- xgboost(data = as.matrix(xdata),
                 label = y,
                 max_depth = 2, # maximum tree depth
                 eta = nu, # shrinkage rate
                 nthread = 2,
                 nrounds = mstop,
                 objective = "reg:linear",
                 subsample=0.66,
                 colsample_bytree = 0.7,
                 min_child_weight = 7 # weights needed in each child to allows splitting
                 )

# make marginal plots: compare XGBOost and our manual boosting
marginal_data <- as.data.frame(apply(xdata,2,function(x){rep(median(x),50)}))

par(mfrow=c(4,3))
for(k in names(xdata)){

    # set all other convariates to their mean (0) (other than k)
    marginal_tmp <- marginal_data
    
    # predict over the range of the target covariate
    marginal_tmp[,k] <- seq(min(xdata[,k]),max(xdata[,k]),length.out = 50)

    # get my predictions    
    f <- rep(y_mean,50)
    for(g in 1:length(ensemble)){
        # update f
        f <- f + nu*ensemble[[g]]$predict(newdata=marginal_tmp)
    }

    # get predictions from XGboost
    f.xgboost<-predict(xgmod,newdata=as.matrix(marginal_tmp))
    
    # plot my boosting function
    plot(marginal_tmp[,k], f, type="l",main = k, col = "red",ylim = range(f,f.xgboost), lwd=2)
    
    # add line for xgboost (blue)
    lines(marginal_tmp[,k], f.xgboost, col ="blue", lwd=2)

    # legend
    legend(x="bottomright",legend=c("xgboost","manual"), col=c("blue","red"),lwd=c(2,2), bty="n")
}

# compare the two algorithms
plot(F,predict(xgmod,newdata=as.matrix(xdata)),ylab="xgboost",xlab="manual boost",main="XGBoost vs. MyTreeBoost");abline(0,1)

My manual version of boosted regression trees vs. XGBoost. Marginal plots.

Key Takeaways:


  • boosted regression trees is just a type of boosting (like we've done above) but with a CART learner
  • XGBoost is just very similar type of FGD but with a custom (and computationally efficient, powerful) tree model (not CART).
  • the tree model allows non-linear and complex interactions, making it potentially very powerful


PART 5: Real World Examples: Boosting in R with Package mboost.


The above manual boosting algorithms were just for illustrative purposes of the more universal features of boosting: weak learners, functional gradient descent, shrinkage, sparsity (for l1-regularization and boosted regression trees)

For a real data analysis, one would use the powerful package mboost. Here are a variety of learners available to you to mix and match in mboost.


# least squares learners
bols(x1) 
bols(x1,by=x2) # interaction

# splines
bbs(x1) # 
bbs(x1,by=x2) # bivariate spline

# tree learners
btree(x1,x2) # 

# radial basis function
brad(x1,x2) 

# random effect
brandom(x1)

# spatial smooth
bspatial(x1,x2)

# markov random field
bmrf(x1)

Example: Component-wise boosting with linear base-learners (e.g., like a Lasso solution)


# LASSO-like regularization
library(mboost)
m.lasso <-  mboost(formula = y ~ bols(age)+bols(sex)+bols(bmi)+bols(map)+bols(tc)+bols(ldl)+bols(hdl)+bols(tch)+bols(ltg)+bols(glu),
                  data = xdata,
                  family = Gaussian(), # controls the gradient and loss function
                  control = boost_control(
                      mstop = 200, # number
                      nu = 0.03) # learning rate
                   )
coef(m.lasso)
summary(m.lasso)
plot(m.lasso)

Here is how we can specify possible interactions (with sex).


library(mboost)
m.interaction <-  mboost(formula = y ~ bols(age,by=sex)+bols(sex)+bols(bmi,by=sex)+bols(map,by=sex)+bols(tc,by=sex)+bols(ldl,by=sex)+bols(hdl,by=sex)+bols(tch,by=sex)+bols(ltg,by=sex)+bols(glu,by=sex),
                  data = xdata,
                  family = Gaussian(), # controls the gradient and loss function
                  control = boost_control(
                      mstop = 200, # number
                      nu = 0.03) # learning rate
                   )
coef(m.interaction)
summary(m.interaction)
plot(m.interaction)

Example: Generalized Additive Model (GAM) with splines.

# SPLINES: GAM
xdata2 <- as.data.frame(cbind(interc=1, xdata)) # and a column vector of 1's for the intercept

m.gam <- mboost(formula = y ~ bbs(age,df=4)+bbs(bmi,df=4)+bbs(map,df=4)+bbs(tc,df=4)+bbs(ldl,df=4)+bbs(hdl,df=4)+bbs(tch,df=4)+bbs(ltg,df=4)+bbs(glu,df=4)+bols(sex,df=4,intercept=FALSE)+bols(interc,intercept=FALSE),
                  data = xdata2,
                  family = Gaussian(), # controls the gradient and loss function
                  control = boost_control(
                      mstop = 150, # number
                      nu = 0.02) # learning rate
                )
par(mfrow=c(3,2))
plot(m.gam)


Here is how to do a boosted regression trees with a special tree model called conditional inference trees (e.g., somewhat like XGBoost).


# boosted regression trees
m.tree <- mboost(formula = y ~ btree(age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu) ,
                  data = xdata2,
                  family = Gaussian(), # controls the gradient and loss function
                  control = boost_control(
                      mstop = 200, # number
                      nu = 0.02) # learning rate
                )

par(mfrow=c(4,3))
plot(m.tree)


PART 6: "Hypothesis Testing" with Boosting.


Everything we've discussed so far has pertained to estimation and prediction. Aside from the Lasso's sparse selection of features, we haven't made inferences about which covariates are "truly" important or significant.

Hypothesis testing is a massive topic. Every reader will come to this blog with there own idea (misconceptions) about what constitutes hypothesis-testing. Here are four different frameworks that are (crudely) about finding a sparse set of "important" covariates.

  • Significance Testing: from R.A. Fisher: uses the p-value to find evidence against a favoured hypothesis.
  • Hypothesis Testing: from Jerzy Neyman and Egon Pearson: capping the long-run rate of Type-I errors at some low threshold (falsely rejecting a null-hypothesis), while trying to minimize the long-run rate of Type-II errors (falsely rejecting the alternative hypothesis).
  • Consistent selection: ability to estimate the coefficients as if the "true model" were known in advanced (e.g., Adaptive Lasso, TwinBoosting), aka Oracle Estimators.
  • Bayesian Model Selection: select the model (and covariates) with highest posterior probability (e.g., with Bayes Factors).


Most people have heard of Fisher's p-value or NP's Hypothesis-testing. Machine Learning practitioners are generally obsessed with sparsity and the consistency property in high-dimensional inference. This can also be thought of as a type of l0 regularization: we are trying to shrink all-unimportant covariates' coefficients to zero, and we want the important covariates to have unbiased estimates (unlike l1 or l2 regularization). Such an estimator is called consistent.

Vanilla boosting has no ability to do hypothesis-inference (no p-values, no guarantees of capping the Type-I errors, no guarantee aboiut consistency, no Bayes Factors). In fact, a component-wise boosting algorithm tuned via cross-validation will tend to have higher false-positives that consistent estimators

However, there are many clever tweaks to the algorithm to get some version of hypothesis-based inference. Two popular choices are:


Let's try Twinboosting on the Diabetes dataset. Twinboosting uses two rounds of boosting. The first round is component-wise boosting, and provides initial estimates of the coefficients in round 2. In round 2, the initial estimates act as a scaling to increasing the selection of the high-weight covariates, and decreasing the selection of low-importance covariates. Amazingly, this leads to a consistent estimator. This type of inference is the following:

  • twinboosting is consistent with large n and a sparse feature set (most covariates have no effect are shrunk to 0);
  • when TwinBoosting selects covariates X, we keep X as our best current candidate for the truth (i.e, a Bayesian interpretation).

Description of TreeBoosting from original Paper


We can also compare the coefficient weights of the l1-regularization (component-wise boosting / Lasso) vs. the coefficients of the l0-regularization (TwinBoosting), as in the Figure below.



# Twinboosting
library(bst)

# twinboosting requires two rounds of boosting (much like the adaptive lasso)
# Round One: Component-wise boosting
cv.round1 <- cv.bst(x=xdata, y=y, K=10, ctrl = bst_control(twinboost=FALSE, mstop=700),n.cores = 5)
plot(cv.round1$cv.error)
# find the best m
best_m1 <- which.min(cv.round1$cv.error)
# fit the first model
mod1 <- bst(x=xdata, y=y, ctrl = bst_control(mstop=best_m1), family = "gaussian", learner = "ls")

# 2nd round of boosting:
# initialize the coefficients with the values form from round 1
cv.round2 <- cv.bst(x=xdata, y=y, K=10, ctrl = bst_control(twinboost=TRUE,coefir=coef(mod1), xselect.init = mod1$xselect, mstop=700),n.cores = 5)
plot(cv.round2$cv.error)

# get the optimal hyperparameter
best_m2 <- which.min(cv.round2$cv.error)
mod.ls <- bst(x=xdata, y=y, ctrl = bst_control(twinboost=TRUE,coefir=coef(mod1), xselect.init = mod1$xselect, mstop=best_m2))

# plot the results
plot(mod.ls)
coef(mod.ls)

# The model only selects THREE covariates: BMI, LTG, MAP, sex, and HDL
matrix(coef(mod.ls),dimnames=list(names(xdata),1))
            1
age   0.00000
sex -13.28290
bmi 589.21010
map 188.90932
tc    0.00000
ldl   0.00000
hdl -87.27599
tch   0.00000
ltg 521.75831
glu   0.00000

# let's compare the coefficient estimates from Lasso solution vs. TwinBoosting
coefs <- matrix(c(coef(mod1),coef(mod.ls)), K,2,dimnames = list(names(xdata),c("l1","l0")))
o <- order(abs(coefs[,1]),decreasing=TRUE)
coefs <- coefs[o,]

jpeg("img/twinboosting_vs_lasso.jpg",width=600,height=600)
barplot(t(coefs),beside=TRUE,legend.text=c(expression(hat(beta)*"-\u2113"*1~"(Lasso)"),expression(hat(beta)*"-\u2113"*0~"(TwinBoost)")),main ="TwinBoost vs. Lasso Coefficients\n(Diabetes Data)")
dev.off()

TwinBoosting coefficients vs. Component-wise Boosting coefficients. (Sparse vs. shrunk).

Take Home Message:


  • TwinBoosting yields a sparser set of coefficients than l1-regularization (more coefficients are exactly 0)
  • The less-important covariates are smaller compared to l1-coefficients
  • The more-important covariates are larger compared to the l1-coefficients (remember, we know the l1-coefficients are slightly biased low).


For more information about other ways of doing "Hypothesis Testing" with boosting, see the high-dimensional p-values 10.1198/jasa.2009.tm08647, or high-dimensional error-control of Type-I errors, or posterior inclusion probabilities based on stability selection.

Summary


  • Boosting is a statistical modelling technique which uses functional gradient descent to minimize a cost function while sequentially building an ensemble of shrunken weak learners
  • Boosting is a generalization of l2-regularization, l1-regularization, and can be tweaked (like in  TwinBoosting) to get a consistent sparse estimator.

No comments:

Post a Comment