**XGBoost**(winner of so many kaggle competitions). We'll show how this powerful learner comes from a simple algorithm which encompasses not only XGBoost, but also

*l*2- and

*l*1-regularzation.

Part of the confusing nature of boosting is that its core concepts have arisen from different disciplines. The majority come from workers in machine learning and learning theory. Later, statisticians united the myriad methods under a statistical framework. For more on this history, I recommend reading Buhlmann and Hothorn or Mayr and Ratch for the modern synthesis across methods. Boosting is very general and comprehensive.

Let's start off with a definition of boosting: "Boosting is a statistical modelling technique which uses

*functional gradient descent*to minimize a cost function while sequentially building an ensemble of

*shrunken weak learners*".

## Key Concepts

I want the audience to understand these four key concepts:- how
**Functional Gradient Descent**works - what are
**Weak Learners**and why are they crucial for boosting - what is an
**Ensemble**of weak learners - what is
**shrinkage**

These concepts separate boosting from similar methods. Later in the blog, I'll discuss other important concepts like l1-regularization, l2-regularization, and hypothesis testing / l0-regularization.

### Key Notation

###
- X: A matrix of covariate/independent effects
- Y: A response variable (here, normally distributed)
- F: "fit-vector" this is our predicted value of y (on the working scale)
- g(y,x): a learner; tries to predict Y from X: g(X)->Y
- r: negative-gradient
- nu: learning rate, less than 1

### Weak Learners

Weak Learners produce a classifier that is slightly better than random guessing (> 50%). The foundations of boosting arise from the insight that weak learners can be iteratively improved (boosted) to produce an overall strong learner. Schapire and Freund developed the first boosting algorithms, such as Adaboost.**Example with trees:**

A weak learner may be a CART-model with a low maximum tree-depth (figure below,

*right*). It is biased, but low variance (i.e, it doesn't vary much with a new sample of data). A strong CART learner would have a higher maximum tree depth (figure below,

*left*), given it more flexibility and yielding less-biased estimates. But, its estimates would vary more with a new sample of data.

CART Learner: strong vs. weak: low-bias high-variance (left), high-bias low-variance (right) |

**Example with linear-learners (least squares learners, splines, etc).**

In it's normal operations, a least-squares learner provides the maximum-likelihood solution to a linear regression between covariates X and continuous response Y (figure below,

*top*). Least-Squares learners can also be weak or strong. For example, a weak version of a least-squares learner would be

*penalized*least-squares (aka, ridge-regression). A strong ridge-penalty decreases the learner's flexibility, complexity, and degrees-of-freedom: it is more constrained and must discriminate among its potential covariates. Like the week CART learner, it is also biased and low variance. In contrast, an unpenalized least-squares learner will just do vanilla linear-regression: certainly a powerful solution that has been the backbone of data analysis for over 100 years.

Least Squares Learner: strong (unpenalized) vs weak (penalized) |

**Why does weakness matter?**

The "weak learner" is a key feature of boosting. Such weak learners separate boosting from the random forest algorithm. In random forest, one creates an ensemble of strong-learners: each is a complex, high-variance tree-learner with a deep maximum tree-depth. In contrast, in boosted regression trees, one uses trees with relatively low max-depth. Other types of learners are possible, like splines, radial basis functions, least-squares: so long as we have some way to heavily

*penalize*complexity and keep them weak.

Why the emphasis on weak learners?

__Weak Learners must make the best of a bad situation:__they must "choose" one or a few of the features, within the bounds of their contraints, whereas strong learners can through everything at a problem.

More formally, the weakness of learners is one of the ways in which boosting is able to negotiate the famous "bias-variance trade-off": at low boosting iterations, the model is simple and biased; at greater boosting iterations, the model is more complex, less-biased, and higher variance. Negotiating the number of boosting iterations prevents overfitting. If we use cross-validation to "tune" the number of boosting iterations, then Buhlmann and Yu showed how boosting has an "adaptive complexity" at the level of the entire ensemble: it can be either low complexity or high complexity, governed by the number of boosting iterations. This can be seen in the animation below, where the fitted spline has maximum degrees-of-freedom 4 (like 4th order polynomial), while the data is clearly 5th or 6th order (it is a sin curve). Nonetheless, by combining twenty 4th-order splines through boosting, the final model clearly can approximate the higher-order data. Weak learners are boosted to yield a strong, complex ensemble-learner.

### Functional Gradient Descent

You've most likely heard of "gradient descent". Figure XXX shows how a parameter (beta) can be iteratively updated by incrementing its value according to the gradient of the loss function. Note that it is the*parameter*that is updated. Let's contrast this with FGD.

Gradient Descent vs. Functional Gradient Descent |

In FGD, we literally

**fit a learner to the negative gradient**(called g). We are iteratively updating the entire "fit-vector" (our best estimate of Y). This is an iterative process, such that as we descend the gradient of the loss function, we generate many fitted-learners, each of which has been fitted at different slopes along the gradient descent. The full algorithm is shown above and in the code below.

Why are we iteratively fitting the negative gradient? An intuitive explanation comes from the early boosting algorithm Adaboost. In that classification algorithm, one sequentially fits the parts of the data that are difficult to classify in earlier iterations. In other words, the algorithm focuses on those cases with very high residuals (lots of unexplained variance).

With a Gaussian loss function, the negative gradient is just the residuals: y-f. Thus, by fitting the negative gradient, we are likewise forcing our boosting to first do-away with the easy cases (small residuals), and then sequentially focus on the tougher cases (bigger residuals). This notion of "adaptive residuals" is generalized for other non-Gaussian loss functions by using the negative gradients.

# Gaussian Loss (y - f)^2 # Gaussian negative-gradient y - f #Binomial Loss: p <- exp(f)/(exp(f) + exp(-f)) -y * log(p) - (1 - y) * log(1 - p) #Binomial negative gradient: exp2yf <- exp(-2 * y * f) -(-2 * y * exp2yf)/(log(2) * (1 + exp2yf)) # Negative Binomial Loss -(lgamma(y + exp(f)) - lgamma(exp(f)) - lgamma(y + 1) + exp(f) * f + y * log(mu) - (y + exp(f)) * log(mu + exp(f))) # Negative Binomial negative-gradient mu: y - (y + sigma)/(exp(f) + sigma) * exp(f) sigma:- exp(f) * (digamma(y + exp(f)) - digamma(exp(f)) + log(exp(f)) + 1 - log(mu + exp(f)) - (exp(f) + y)/(mu + exp(f))) # see ?Family from package mboost for more

### Ensemble Learning

If we ran the boosting algorithm for 20 iterations, then we would have 20 fitted learners. This forms an ensemble of leaners. The final model is the aggregate of sum all the learners. In order to make predictions on new data, each learner would have to make a prediction, and those predictions are summed (element-wise). The aggregation is shrunk by a*learning rate*. The learning rate is small, so that the outputs from each weak learner is small.

The way we aggregate the predictions of the ensemble of learners is different from Bagging methods (like Random Forest) which aggregate the learners in some other way (like majority voting).

### Shrinkage

A key part of boosting is that we "stop short" the algorithm. The number of iterations is determined by, e.g., cross-validation or the AIC. For least-squares learners, if we ran the algorihtm until infinity, we would eventually get the Maximum Likelihood Estimate of the parameters. For some smaller number of iterations, the estimated parameters will be*shrunk towards zero*. Copas 1983 discovered the importance of shrinkage for prediction.

The shrinkage phenomenon is strange. It is related to the James-Stein effect for estimating the means of a multivariate Gaussian. Copas then noticed a similar effect for GLM models. Boosted models are a type of shrinkage estimator.

The shrinkage manifests as a slight bias-towards-zero (away from the MLE). We now understand that this is a way of negotiating the classic Bias-Variance Trade-off. For more on the Bias-Variance trade-off and shrinkage, see my blog post at called "A primer on the Bias-Variance Trade-off".

Many non-statistician researchers are shocked to see such obvious bias. But it is strategic: we are incurring some bias so to reduce the variance and reduce the overall Expected Risk (the Expected Mean Square Error). Having a lower Expected Risk means that we are

**more likely to be closer to the truth**even if we are systematically biased. If you don't grok this idea of the bias-variance trade-off, then Copas (1997) offers an intuitive explanation: "One of the classic examples of regression to the mean is the inheritance of height. Tall fathers tend to have shorter sons, and short fathers tend to have taller sons. Here y1 (father's height) and y2 (son's adult height) are two random variables with (almost exactly) the same distribution, but the conditional expectation E[y2|y1] is not y1 but a value closer to the overall mean."

## PART 1: Example of Functional Gradient Descent I: Shrinkage Estimator

I will now show FGD in action by manually performing FGD with just one Least Squares learner (e.g,. a multiple regression learner). The dataset is the famous diabetes data (442 observations, 10 covariates) Efron et al "Least Angle Regression" paper. Our learner will use all 10 covariates: age,sex,bmi,map,tc,ldl,hdl,tch,ltg,gluLet's first load the data and calculate the MLEs

library(MASS) # get the lm.ridge function library(lars) # get the LASSO/LAR functions data(diabetes) # data set to analyze # response variable: y y <- diabetes$y # explanatory covariates x <- scale(diabetes$x) # standardize the covariates xdata <- setNames(as.data.frame(matrix(as.numeric(diabetes$x),nrow(diabetes$x),ncol(diabetes$x))),nm=colnames(x)) # nummber of observations n <- nrow(xdata) # number of covariates K <- ncol(xdata) # show the maximum likelihood estimates coef(lm(formula = y~.,data=data.frame(cbind(y=y,xdata))))

MLEs of the diabetes data:

> coef(lm(formula = y~.,data=data.frame(cbind(y=y,xdata)))) (Intercept) age sex bmi map tc ldl hdl tch ltg glu 152.13348 -10.01220 -239.81909 519.83979 324.39043 -792.18416 476.74584 101.04457 177.06418 751.27932 67.62539

Now let's define our learner: it does very simple thing, just returns a function ("fit") that calculates the least-squares solution for a given Y.

# Learning: Least Squares: LeastSquaresLearner <- function(X, penalty = 10^-7){ # X: input the covariates (X); # penalty: possible ridge penalty # add intercept to the data: X X <- as.matrix(cbind(intercept = rep(1,nrow(X)),X)) # (optional) add ridge penalty as a diagonal matrix Omega = diag(ncol(X))*penalty # X'X + Ridge Penalty XtX <- crossprod(X, X)+Omega # Fitting function: get least-squares solution fit <- function(y){ # get betas: inv(Xt*X+Omega) * Xt * y beta_hat = solve(XtX, crossprod(X, y)) # inv(Xt*X+Omega)*Xt * y beta_hat = as.numeric(beta_hat) # prediction function predict <- function() X%*%beta_hat # return fitted learner return(list( beta = beta_hat, # LS solution predict = predict # provides the fit vector ) ) } # outer return return(list(fit = fit)) } # try the learner l1 = LeastSquaresLearner(xdata) l1_fit = l1$fit(y) f = l1_fit$predict()

Finally, we'll perform Functional Gradient Descent. A each iteration we: update the residuals/negative-gradient (r) by taking the difference between the fit-vector (F) and the response variable (Y); then fit the learner to the negative-gradient (g), increase our fit-vector by the outputs from the learner, shrunk by the learning-rate (nu).

#### Key features:

- the learner is NOT weak, it is an
*un*penalized multi-regressor. - there is only ONE learner: the learner performs multiple-regression.
- this is for illustrative purposes only. No one would do this in reality (there are much better estimators)
- boosting hyperparameters: the number of boosting steps is and the learning-rate.

# PART 1 : SHRINKAGE (Copas 1983) # let's compare to functional gradient descent! mstop = 1200 # number of iterations nu= 0.05 # learning rate beta_FGD <- matrix(0,mstop, ncol(xdata)) # intialize the learner ls_learner = LeastSquaresLearner(xdata) # initialize the model: model of the mean y_mean <- mean(y) F <- rep(y_mean,n) # initialize the gradient negative_gradient <- function(y,F){ y-F } r <- negative_gradient(y,F) # risk function: sum of square error loss <- function(y,F) (y-F)^2 risk <- function(y,F) sum(loss(y,F)) vRisk <- numeric(mstop) # boosting for(m in 1:mstop){ # fit the learner to the residual g <- ls_learner$fit(r) # update the Model F <- F + nu * g$predict() # recalculate the gradient r <- negative_gradient(y, F) # store: empirical risk vRisk[m] <- risk(y,F) # store the coeffcients (minus the intercept) beta_FGD[m,] <- nu*g$beta[-1] } # Betas: per boosting step beta_boost <- apply(beta_FGD, 2,cumsum) # Beta: maximum likelihood estimation beta_MLE <- coef(lm(y~.,data = data.frame(cbind(y=y,xdata)))) # plot the regularization pathways plot(range(log(1:mstop)), range(beta_boost),type="n", xlab = "log(boosting step)", ylab = expression(hat(beta)),main = "Functional Gradient Descent:\n Regularization Pathways") # plot each coefficient for(k in 1:ncol(beta_FGD)){ lines(x=log(1:mstop), y=beta_boost[,k], col=k) } # add MLEs points(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, pch=19) text(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4) text(log(mstop+1), max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA)

Regularization Pathways: as we go to the left (more steps in the gradient descent) the regression coefficients increase until they reach the Maximum Likelihood Estimates (right). |

#### Key Takeaways (Shrinkage Estimator)

- the FGB eventually produces the Maximum Likelihood Estimates as the number of boosting iterations gets very large.
- at smaller boosting iterations, the betas are shrunk towards zero
**but none are exactly zero**. - this is most similar to the "pre-shrunk" estimators from Copas 1983 (no one uses these anymore).
- there is
*no feature selection*at lower m: all betas are shrunk-towards-zero by a constant from their MLE's (i.e, early stopping doesn't favour one covariate vs. another). The strong learner makes use of every single feature at every iteration.

##
PART 2: Example of Functional Gradient Descent II: Ridge Regression (*l*2-regularization)

In the next example, we will use the same learner from above, but the learner will be

**penalized**. We will apply a strong ridge penalty that reduces the learners degrees of freedom, i.e., the learner will have to start "choosing" among features within the bounds of its constraints.

#### Key Features:

- there is only ONE learner: the least-square learner performs penalized multiple-regression.
- the penalized learner is weak compared to the previous section.
- hyperparameters: mstop=7000; learning-rate nu=0.5; the learner-penalty of 7

# PART 2: L2 REGULARIZATION mstop = 7000 # number of iterations nu= 0.5 # learning rate beta_FGD <- matrix(0,mstop, ncol(xdata)) # intialize the learner ls_learner = LeastSquaresLearner(xdata,penalty=7) # initialize the Fit-vector y_mean <- mean(y) F <- rep(y_mean,n) # goal: get F close to Y (minimizes loss) # risk function: sum of square error loss <- function(y,F) (y-F)^2 risk <- function(y,F) sum(loss(y,F)) vRisk <- numeric(mstop) # initialize the gradient negative_gradient <- function(y,F){ y-F } r <- negative_gradient(y,F) # boosting for(m in 1:mstop){ # fit the learner to the residual g <- ls_learner$fit(r) # update the Fit Vector F <- F + nu * g$predict() # recalculate the gradient r <- negative_gradient(y, F) # estimate the loss vRisk[m] <- risk(y,F) # store the coeffcients (minus the intercept) beta_FGD[m,] <- nu*g$beta[-1] } # Betas per boosting step beta_boost <- apply(beta_FGD, 2,cumsum) # Beta: maximum likelihood estimation beta_MLE <- coef(lm(y~.,data = data.frame(cbind(y=y,xdata)))) # plot the regularization pathway par(mfrow=c(2,1)) plot(range(log(1:mstop)), range(beta_boost),type="n", xlab = "log(boosting step)", ylab = expression(hat(beta)),main = "Functional Gradient Descent:\n Regularization Pathways") # plot each coefficient for(k in 1:ncol(beta_FGD)){ lines(x=log(1:mstop), y=beta_boost[,k], col=k) } # add MLEs points(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, pch=19) text(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4,xpd=NA) text(log(mstop+1), max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA) # Compare to ridge regression penalty_seq <- exp(seq(log(10000),log(0.2),length.out = 50)) # 0.01 to 10 # fit ridge regr for each value of the penalty_seq lridge <- lm.ridge(y ~ age + sex + bmi + map + tc + ldl + hdl + tch + ltg + glu, data = data.frame(cbind(y=y, xdata)),lambda = penalty_seq) # all ridge solutions beta_ridge <- coef(lridge)[,-1] # -1 ignores the intercept # make a plot plot(rev(range(log(penalty_seq))), range(beta_ridge),type="n", xlab = "ridge penalty", ylab = expression(hat(beta)), xaxt = "n", main = "Ridge: Regularization Pathways") axis(1,at=rev(log(penalty_seq)), labels=sprintf("%0.5s",penalty_seq),las=2) # plot each coefficient for(k in 1:ncol(beta_ridge)){ lines(x=rev(log(penalty_seq)), y=beta_ridge[,k], col=k) } # add MLEs points(rep(max(log(penalty_seq)),K), beta_MLE[-1], col = 1:K, pch=19) text(rep(max(log(penalty_seq)),K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4) text(max(log(penalty_seq)), max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA)

For comparison, I have shown the "regularization pathways" of a Ridge Regression solution. From left to right, we are weakening the regularization of the Ridge Regression (lambda hyperparameter), until eventually there is no penalty and the solutions are the MLEs.

Notice any similarity between the regularization pathways of the boosted penalized least-squares learner and the ridge-regression?

Regularization Pathways for l2-regularizer: left is stronger regularization, right is weaker regularization |

#### Key Takeaways.

- boosting can approximate the solutions of l2-regularization (Ridge Regression).
- a penalized/weak learner is forced to discriminate among features, which begets some feature learning. You can see this in the above regularization pathways in which the betas have different relative magnitudes at different steps (e.g, "tc" is
*disfavoured*until about exp(6)=403 boosting iterations, after which its beta value explodes in magnitude). - at smaller boosting iterations, the betas are shrunk towards zero
**but none are exactly zero**. - the FGB obtains the MLEs as the number of boosting iterations goes to infinity

**NOTE**The optimal number of boosting iterations should be decided by cross-validation or bootstrap validation. The above was just for educational purposes.

## PART 3: Example of Functional Gradient Descent III: Component-wise Boosting

In the next example, we will use

**a different learner per covariate**. In Examples I and II, there was one least-squares learner, which performed multiple regression using all the covariates. Our next algorithm, which uses a different learner per covariate, is called

**component-wise boosting**. The key is that for each boosting iteration,

**only one learner/covariate**is picked to enter the ensemble.

Something magical happens with component-wise boosting :)

#### Key Features:

- there is one learner per covariate.
- each learner competes to fit the gradient and add it's fit to the overall fit-vector F.
- boosting hyperparameters: the number of boosting steps is mstop=5000 and the learning-rate is 0.2; the learners are unpenalized

mstop = 5000 # number of iterations nu= 0.2 # learning rate # store the coefficients per boosting step beta_FGD <- matrix(0,mstop, ncol(xdata)) # store the learners ensemble <- vector(mode="list",length = mstop) # intialize the learners # notice there are 10 learners: one learner per covariate ls_learners = setNames(vector(mode = "list", length = K),nm=colnames(xdata)) for(k in 1:K){ ls_learners[[k]] <- LeastSquaresLearner(xdata[,k,drop=FALSE]) } # initialize the fit-vector (F) current estimate of y y_mean <- mean(y) F <- rep(y_mean,n) # Fit-vector # goal: get F close to Y # risk function: sum of square error loss <- function(y,F) (y-F)^2 risk <- function(y,F) sum(loss(y,F)) vRisk <- numeric(mstop) # initialize the gradient negative_gradient <- function(y,F){ y-F } r <- negative_gradient(y,F) # dummy list: temporary storge of learners g.list <- setNames(vector(mode = "list", length = K),nm=colnames(xdata)) # boosting for(m in 1:mstop){ # fit EACH learner to the residual g_SS <- numeric(K) for(k in 1:length(ls_learners)){ # fit each learner g.list[[k]] <- ls_learners[[k]]$fit(r) # calculate GOF statistic per learner g_SS[k] <- sum((g.list[[k]]$predict()-r)^2) } # get best learner best_learner <- which.min(g_SS) g <- g.list[[best_learner]] # update the Fit Vector F <- F + nu * g$predict() # recalculate the gradient r <- negative_gradient(y,F) # store the empirical risk vRisk[m] <- risk(y,F) # store the coeffcients (minus the intercept) beta_FGD[m,best_learner] <- nu*g$beta[-1] # add learner to the ensemble (optional) ensemble[[m]] <- g } # add up all the coefficients per learner beta_boost <- apply(beta_FGD, 2,cumsum) par(mfrow=c(2,1),bty="n") plot(range(log(1:mstop)), range(beta_boost),type="n", xlab = "log(boosting step)", ylab = expression(hat(beta)), ,main = "Functional Gradient Descent:\n Regularization Pathways") # plot each coefficient for(k in 1:ncol(beta_FGD)){ lines(x=log(1:mstop), y=beta_boost[,k], col=k) } # ADD mles points(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, pch=19) text(rep(log(mstop+1),K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4,xpd=NA) text(log(mstop+1), max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA) # LASSO: estimate the lasso betas library(glmnet) lasso <- glmnet(x=as.matrix(xdata),y = y, alpha=1,lambda=exp(seq(log(45),log(0.025),length.out=1000))) beta_lasso <- t(lasso$beta) vLambda <- lasso$lambda # plot the lasso betas rangebeta = range(beta_lasso) nsteps = nrow(beta_lasso) plot(c(1,nsteps), rangebeta,type="n", xlab = "LASSO regularization steps",ylab = expression(hat(beta))) for(k in 1:ncol(beta_lasso)){ lines(1:nsteps, beta_lasso[,k], col = k) } # ADD MLES points(rep(nsteps,K), beta_MLE[-1], col = 1:K, pch=19) text(rep(nsteps,K), beta_MLE[-1], col = 1:K, labels = names(xdata),pos=4,xpd=NA) text(nsteps, max(beta_MLE[-1])*1.1, col = 1:K, labels = "MLE",pos=3,cex=1.2,xpd=NA)

For comparison, I have shown the "regularization pathways" of Lasso Regression. From left to right, we are weakening the regularization on the Lasso (lambda hyperparameter), until we get to the far-right solution where we basically obtain the MLEs. Towards the left (stronger regularization), notice that only a few of the covariates have every been picked and have non-zero coefficients. In other words, at low iterations, the Lasso provides a

**sparse**solution: some of the less-important covariates have coefficients that are exactly zero. This is a type of

**feature selection:**the algorithm is learning to pick and chose among covariates, shrinking some to absolute zero.

Notice any similarity between the regularization pathways of the component-wise boosting and the Lasso? In general, they are similar but not exact. We say that they are approximately the same, and that component-wise boosting provides a type of l1-regularization (Hastie,Tibshirani,Friedman 2001).

l1-Regularization: left is stronger regularization, right is stronger regularization |

#### Key Takeaways.

- boosting can approximate the solutions of
*l*1-regularization (Lasso regression). - component-wise boosting and
*l*1-regularization provide automatic feature selection. - the FGB obtains the MLE as the number of boosting iterations gets large.
- at smaller boosting iterations, some betas are shrunk towards zero (i.e., a sparse solution).

**NOTE**The optimal number of boosting iterations should be decided by cross-validation or bootstrap validation. The above was just for educational purposes.

The benefit of using component-wise boosting over a LASSO model is that we can have very complex combinations of learners. For example, in the R package mboost, we might like to have some interactions between sex and the other covariates.

####
Why all the excitement for *l*1-regularization?

A key feature of Lasso-like solutions is a property called **minimax optimal rate of convergence**in high-dimensional regression. In English, this means that our worst-case theoretical error is minimized as fast as possible with growing n. Among frequentists, minimax-optimality is just one of many desirable properties of an estimator: it provides a theoretical guarantee that in the worst-case (high error), we'll still beat all the other losers.

Furthermore,

*l*1-regularization also often has the lowest Prediction Errors (e.g., Expected Errors). Let's demonstrate this through cross-validation, comparing the Shrinkage model, the l2-regularization model, component-wise boosting model, and of course Maximum Likelihood Estimates (MLE)

We will run all these models using the R package mboost. The least-squares learners are exactly the same as in the previous examples. See ?bols for an in-depth description of how mboost's formula and learner specification.

# Cross-Validation: compare four estimators in terms of prediction error # Models: MLEs, Shrinkage, Ridge, and Component-wise boosting library(mboost) # boosting library # join Y and X into a single dataframe alldata <- data.frame(cbind(y=y, xdata)) # make the cross-validation weights cv10f <- cv(weights = rep(1,n), type = "kfold") # re-run each model in mboost: Shrinkage, l2/Ridge, l2/Lasso, MLE # boosting with one unpenalized learner (not-boosting technically, just a Copas shrinkage model) m.shrinkage <- mboost(y~bols(age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu), data = alldata,control=boost_control(mstop=10000,nu=0.2)) # boosting with one penalized learner (l2-regularization) m.l2 <- mboost(y~bols(age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu,lambda=7), data = alldata,control=boost_control(mstop=10000,nu=0.2)) # component-wise boosting m.l1 <- mboost(y~bols(age)+bols(sex)+bols(bmi)+bols(map)+bols(tc)+bols(ldl)+bols(hdl)+bols(tch)+bols(ltg)+bols(glu), data = alldata,control=boost_control(mstop=10000,nu=0.2)) m.MLE = glm(y~.,data=alldata) # do 10-fold cross-validation on each model cvfit.shrink <- cvrisk(m.shrinkage, folds = cv10f, papply = mclapply,mc.cores=5) # copas model (not-boosting) cvfit.12 <- cvrisk(m.l2, folds = cv10f, papply = mclapply,mc.cores=5) # l2 (ridge) cvfit.11 <- cvrisk(m.l1, folds = cv10f, papply = mclapply,mc.cores=5) # l1 (component-wise) cvfit.MLE <- lapply(1:10,function(k){ # insample weights cv.wts<-cv10f[,k] # run cv glm mle<-glm(y~.,data=alldata,weights=cv.wts) # loss function (y-f)^2 sum(((y-mle$fitted)*!cv.wts)^2)/sum(!cv.wts) }) # Expected Error (according to optimal boosting step per model) cvError = c( "MLE" = mean(unlist(cvfit.MLE)), "Shrinkage Model (Copas)" = min(colMeans(cvfit.shrink)), "l2 (Ridge)" = min(colMeans(cvfit.12)), "l1 (Component-wise)" = min(colMeans(cvfit.11)) ) # plot the CV-errors par(mar=c(11,5,2,0),mgp=c(4,1,0)) barplot(cvError,ylim=range(pretty(cvError)),las=2,ylab="Expected Error (10-fold CV)", main = "Component-wise boosting wins, with lowest predictive error\n (\u21131-regularization) ",xpd=FALSE,col="pink")

Comparison of prediction error from cross-validation, according to four different types of estimators. Lower prediction error is better. |

Why this order of predictive performance? From worse to best:

- Maximum likelihood model has no shrinkage, no feature learning
- The shrinkage model has shrinkage, but no feature learning
- The
*l*2 model has shrinkage, and some weak feature learner - The
*l*1 model (component-wise boosting) has shrinkage and does automatic feature learning, successfully setting some unimportant covariates to zero

## PART 4: Example of Functional Gradient Descent IV: Boosted Regression Trees.

In the next example, our learner will be a single (weak) decision trees from the package rpart (which is a CART model). Despite this wildly different learner, the FGD algorithm is the same: we sequentially fit the learner to the negative gradient (residuals1). The only thing that has changed is the learner: a CART model with small maximum depth vs penalized least-squares.

We will compare this boosted regression tree model to the FAMOUS and POWERFUL XGBoost! (go-to solution for many Kaggle competitions, and rightly so). We start by defining our learner:

library(party) TreeLearner = function(X, control = ctree_control(maxdepth=2,mtry=round(0.7 * ncol(X)),mincriterion=0), subsample = 1){ mf <- as.data.frame(X) n <- nrow(mf) if(subsample <1){ get_weights <- function() tabulate(sample(1:n,size=round(n*subsample)),n) } else { get_weights <- function() rep(1,n) } # fit a PARTY model (conditional inference tree) fit = function(y){ # fit a tree model btreemodel <-ctree(y ~ ., data = data.frame(cbind(y=y,mf)), weights = get_weights(), controls=control) # return the predictions predict <- function(newdata=NULL) btreemodel@predict_response(newdata) return(list( beta=btreemodel, predict = predict )) } return(list(fit = fit)) }

#### Key Features:

- one CART learner, with hyperparameters: maxdepth (2), minbucket (7), subsampling rate 0.66
- for each boosting iteration, the learner gets a random subsample of rows of data (like bagging!)
- for each boosting iteration, the learner gets a random subsample of columns of features (like random forest!)
- boosting hyperparameters: the number of boosting steps is 200 and the learning-rate is 0.02.

# initialize the fit-vector y_mean <- mean(y) F <- rep(y_mean,n) # initialize the gradient negative_gradient <- function(y,F){ y-F } r <- negative_gradient(y,F) # initialize the learner learners <- TreeLearner(xdata,subsample=0.66) # initialize the ensemble mstop <- 200 nu <- 0.02 ensemble <- vector(mode = "list", length = mstop) # tree boosting for(m in 1:mstop){ g <- learners$fit(r) # update the Fit-vector F <- F + nu * g$predict() # recalculate the gradient r <- negative_gradient(y,F) # add the learner to the ensemble ensemble[[m]] <- g } # fit an XGBoost model library(xgboost) xgmod <- xgboost(data = as.matrix(xdata), label = y, max_depth = 2, # maximum tree depth eta = nu, # shrinkage rate nthread = 2, nrounds = mstop, objective = "reg:linear", subsample=0.66, colsample_bytree = 0.7, min_child_weight = 7 # weights needed in each child to allows splitting ) # make marginal plots: compare XGBOost and our manual boosting marginal_data <- as.data.frame(apply(xdata,2,function(x){rep(median(x),50)})) par(mfrow=c(4,3)) for(k in names(xdata)){ # set all other convariates to their mean (0) (other than k) marginal_tmp <- marginal_data # predict over the range of the target covariate marginal_tmp[,k] <- seq(min(xdata[,k]),max(xdata[,k]),length.out = 50) # get my predictions f <- rep(y_mean,50) for(g in 1:length(ensemble)){ # update f f <- f + nu*ensemble[[g]]$predict(newdata=marginal_tmp) } # get predictions from XGboost f.xgboost<-predict(xgmod,newdata=as.matrix(marginal_tmp)) # plot my boosting function plot(marginal_tmp[,k], f, type="l",main = k, col = "red",ylim = range(f,f.xgboost), lwd=2) # add line for xgboost (blue) lines(marginal_tmp[,k], f.xgboost, col ="blue", lwd=2) # legend legend(x="bottomright",legend=c("xgboost","manual"), col=c("blue","red"),lwd=c(2,2), bty="n") } # compare the two algorithms plot(F,predict(xgmod,newdata=as.matrix(xdata)),ylab="xgboost",xlab="manual boost",main="XGBoost vs. MyTreeBoost");abline(0,1)

My manual version of boosted regression trees vs. XGBoost. Marginal plots. |

#### Key Takeaways:

- boosted regression trees is just a type of boosting (like we've done above) but with a CART learner
- XGBoost is just very similar type of FGD but with a custom (and computationally efficient, powerful) tree model (not CART).
- the tree model allows non-linear and complex interactions, making it potentially very powerful

## PART 5: Real World Examples: Boosting in R with Package mboost.

The above manual boosting algorithms were just for illustrative purposes of the more universal features of boosting: weak learners, functional gradient descent, shrinkage, sparsity (for l1-regularization and boosted regression trees)

For a real data analysis, one would use the powerful package mboost. Here are a variety of learners available to you to mix and match in mboost.

# least squares learners bols(x1) bols(x1,by=x2) # interaction # splines bbs(x1) # bbs(x1,by=x2) # bivariate spline # tree learners btree(x1,x2) # # radial basis function brad(x1,x2) # random effect brandom(x1) # spatial smooth bspatial(x1,x2) # markov random field bmrf(x1)

**Example: Component-wise boosting with linear base-learners**(e.g., like a Lasso solution)

# LASSO-like regularization library(mboost) m.lasso <- mboost(formula = y ~ bols(age)+bols(sex)+bols(bmi)+bols(map)+bols(tc)+bols(ldl)+bols(hdl)+bols(tch)+bols(ltg)+bols(glu), data = xdata, family = Gaussian(), # controls the gradient and loss function control = boost_control( mstop = 200, # number nu = 0.03) # learning rate ) coef(m.lasso) summary(m.lasso) plot(m.lasso)

Here is how we can specify possible interactions (with sex).

library(mboost) m.interaction <- mboost(formula = y ~ bols(age,by=sex)+bols(sex)+bols(bmi,by=sex)+bols(map,by=sex)+bols(tc,by=sex)+bols(ldl,by=sex)+bols(hdl,by=sex)+bols(tch,by=sex)+bols(ltg,by=sex)+bols(glu,by=sex), data = xdata, family = Gaussian(), # controls the gradient and loss function control = boost_control( mstop = 200, # number nu = 0.03) # learning rate ) coef(m.interaction) summary(m.interaction) plot(m.interaction)

**Example: Generalized Additive Model (GAM) with splines.**

# SPLINES: GAM xdata2 <- as.data.frame(cbind(interc=1, xdata)) # and a column vector of 1's for the intercept m.gam <- mboost(formula = y ~ bbs(age,df=4)+bbs(bmi,df=4)+bbs(map,df=4)+bbs(tc,df=4)+bbs(ldl,df=4)+bbs(hdl,df=4)+bbs(tch,df=4)+bbs(ltg,df=4)+bbs(glu,df=4)+bols(sex,df=4,intercept=FALSE)+bols(interc,intercept=FALSE), data = xdata2, family = Gaussian(), # controls the gradient and loss function control = boost_control( mstop = 150, # number nu = 0.02) # learning rate ) par(mfrow=c(3,2)) plot(m.gam)

Here is how to do a boosted regression trees with a special tree model called conditional inference trees (e.g., somewhat like XGBoost).

# boosted regression trees m.tree <- mboost(formula = y ~ btree(age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu) , data = xdata2, family = Gaussian(), # controls the gradient and loss function control = boost_control( mstop = 200, # number nu = 0.02) # learning rate ) par(mfrow=c(4,3)) plot(m.tree)

## PART 6: "Hypothesis Testing" with Boosting.

Everything we've discussed so far has pertained to

**estimation**and

**prediction**. Aside from the Lasso's sparse selection of features, we haven't made inferences about which covariates are "truly" important or significant.

Hypothesis testing is a massive topic. Every reader will come to this blog with there own idea (misconceptions) about what constitutes hypothesis-testing. Here are four different frameworks that are (crudely) about finding a sparse set of "important" covariates.

**Significance Testing**: from R.A. Fisher: uses the p-value to find evidence against a favoured hypothesis.**Hypothesis Testing**: from Jerzy Neyman and Egon Pearson: capping the long-run rate of Type-I errors at some low threshold (falsely rejecting a null-hypothesis), while trying to minimize the long-run rate of Type-II errors (falsely rejecting the alternative hypothesis).**Consistent selection**: ability to estimate the coefficients as if the "true model" were known in advanced (e.g., Adaptive Lasso, TwinBoosting), aka Oracle Estimators.**Bayesian Model Selection**: select the model (and covariates) with highest posterior probability (e.g., with Bayes Factors).

Most people have heard of Fisher's p-value or NP's Hypothesis-testing. Machine Learning practitioners are generally obsessed with

*sparsity*and the

*consistency*property in high-dimensional inference. This can also be thought of as a type of

*l*0 regularization: we are trying to shrink all-unimportant covariates' coefficients to zero, and we want the important covariates to have unbiased estimates (unlike l1 or l2 regularization). Such an estimator is called consistent.

Vanilla boosting has no ability to do hypothesis-inference (no p-values, no guarantees of capping the Type-I errors, no guarantee aboiut consistency, no Bayes Factors). In fact, a component-wise boosting algorithm tuned via cross-validation will tend to have higher

**false-positives**that consistent estimators

However, there are many clever tweaks to the algorithm to get some version of hypothesis-based inference. Two popular choices are:

- boosting with stability selection to get a sparse consistent estimator;
- boosting with stability selection to derive Bayesian posterior inclusion probabilities;
- TwinBoosting to get a consistent estimator like the Adaptive Lasso (10.1007/s11222-009-9148-5).

Let's try Twinboosting on the Diabetes dataset. Twinboosting uses two rounds of boosting. The first round is component-wise boosting, and provides initial estimates of the coefficients in round 2. In round 2, the initial estimates act as a scaling to increasing the selection of the high-weight covariates, and decreasing the selection of low-importance covariates. Amazingly, this leads to a consistent estimator. This type of inference is the following:

- twinboosting is consistent with large n and a sparse feature set (most covariates have no effect are shrunk to 0);
- when TwinBoosting selects covariates X, we keep X as our
*best current candidate for the truth*(i.e, a Bayesian interpretation).

Description of TreeBoosting from original Paper |

We can also compare the coefficient weights of the l1-regularization (component-wise boosting / Lasso) vs. the coefficients of the l0-regularization (TwinBoosting), as in the Figure below.

# Twinboosting library(bst) # twinboosting requires two rounds of boosting (much like the adaptive lasso) # Round One: Component-wise boosting cv.round1 <- cv.bst(x=xdata, y=y, K=10, ctrl = bst_control(twinboost=FALSE, mstop=700),n.cores = 5) plot(cv.round1$cv.error) # find the best m best_m1 <- which.min(cv.round1$cv.error) # fit the first model mod1 <- bst(x=xdata, y=y, ctrl = bst_control(mstop=best_m1), family = "gaussian", learner = "ls") # 2nd round of boosting: # initialize the coefficients with the values form from round 1 cv.round2 <- cv.bst(x=xdata, y=y, K=10, ctrl = bst_control(twinboost=TRUE,coefir=coef(mod1), xselect.init = mod1$xselect, mstop=700),n.cores = 5) plot(cv.round2$cv.error) # get the optimal hyperparameter best_m2 <- which.min(cv.round2$cv.error) mod.ls <- bst(x=xdata, y=y, ctrl = bst_control(twinboost=TRUE,coefir=coef(mod1), xselect.init = mod1$xselect, mstop=best_m2)) # plot the results plot(mod.ls) coef(mod.ls) # The model only selects THREE covariates: BMI, LTG, MAP, sex, and HDL matrix(coef(mod.ls),dimnames=list(names(xdata),1)) 1 age 0.00000 sex -13.28290 bmi 589.21010 map 188.90932 tc 0.00000 ldl 0.00000 hdl -87.27599 tch 0.00000 ltg 521.75831 glu 0.00000 # let's compare the coefficient estimates from Lasso solution vs. TwinBoosting coefs <- matrix(c(coef(mod1),coef(mod.ls)), K,2,dimnames = list(names(xdata),c("l1","l0"))) o <- order(abs(coefs[,1]),decreasing=TRUE) coefs <- coefs[o,] jpeg("img/twinboosting_vs_lasso.jpg",width=600,height=600) barplot(t(coefs),beside=TRUE,legend.text=c(expression(hat(beta)*"-\u2113"*1~"(Lasso)"),expression(hat(beta)*"-\u2113"*0~"(TwinBoost)")),main ="TwinBoost vs. Lasso Coefficients\n(Diabetes Data)") dev.off()

TwinBoosting coefficients vs. Component-wise Boosting coefficients. (Sparse vs. shrunk). |

#### Take Home Message:

- TwinBoosting yields a sparser set of coefficients than
*l*1-regularization (more coefficients are exactly 0) - The less-important covariates are smaller compared to
*l*1-coefficients - The more-important covariates are larger compared to the
*l*1-coefficients (remember, we know the*l*1-coefficients are slightly biased low).

For more information about other ways of doing "Hypothesis Testing" with boosting, see the high-dimensional p-values 10.1198/jasa.2009.tm08647, or high-dimensional error-control of Type-I errors, or posterior inclusion probabilities based on stability selection.

## Summary

- Boosting is a statistical modelling technique which uses
*functional gradient descent*to minimize a cost function while sequentially building an ensemble of*shrunken weak learners* - Boosting is a generalization of
*l*2-regularization,*l*1-regularization, and can be tweaked (like in TwinBoosting) to get a consistent sparse estimator.