Bagging and boosting

Author

Ryan Giordano

Goals

Introduce the second black–box modification of a “weak learner”: - Introduce boosting - Intuitively for high–bias, low variance procedures - General additive models (random forests as a special case) - Greedy forward optimization and boosting

Reading

The primary reading is Gareth et al. (2021) 8.2. The reading in Hastie, Tibshirani, and Friedman (2009) section 8.7, 10.1–10.4 (GAMs and ADABoost) and especially 10.9-10.10 are also useful. These notes are intended only as a supplement to the primary reading.

Additive models

In the last lecture, we introduced random forests estimators, which took the form

\[ \hat{f}_{RF}(\boldsymbol{x}) = \sum_{m=1}^M \frac{1}{M} \hat{f}_m(\boldsymbol{x}; T_m), \]

where each \(f_m(\cdot; T_m)\) was a regression or classification tree, parameterized by topology and leaf notes \(T_m\), trained on a bootstrapped sample of the data, using a relatively small set of predictors to decorrelated the tree.

Recall also that a linear model takes the form

\[ \hat{f}_{Lin}(\boldsymbol{x}) = \sum_{m=1}^M \beta_m x_m. \]

Formally, we can think of the random forest as a linear model with (sub–optimal) coefficients \(\beta_m = 1/M\), and nonlinearly–transformed regressors given by the trees \(\hat{f}_m(\boldsymbol{x}; T_m)\), which are learned (greedily, sub–optimally) on the bootstrapped data.

The similarity motivates a general class of models called “generalized additive models” (Hastie, Tibshirani, and Friedman (2009) 9.1) which take the form

\[ \hat{f}_{GAM}(\boldsymbol{x}) = \sum_{m=1}^M \theta_m \phi_m(\boldsymbol{x}), \]

which we might try to optimize by minimizing the empirical loss

\[ \hat{\theta}_1,\ldots,\hat{\theta}_M,\hat{\phi}_1,\ldots,\hat{\phi}_M := \underset{}{\mathrm{argmin}}\, \frac{1}{N} \sum_{n=1}^N\mathscr{L}(\hat{f}_{GAM}(\boldsymbol{x}_n), y_n), \]

where \(\phi_m(\cdot)\) is in some constrained class (otherwise this function class could drive training loss to zero, and might be expected to generalize poorly). For example, we might take \(\phi_m(\cdot) = \phi(\cdot; T_m)\) to be a regression tree of fixed size, or which minimizes some loss that regularizes tree size, weights, or both.

Classification trees

Some extra care needs to be taken when using GAMS for classification. If we have a GAM of the form

\[ \hat{f}_{GAM}(\boldsymbol{x}) = \sum_{m=1}^M \hat{\theta}_m \hat{\phi}_m(\boldsymbol{x}), \]

there are several ways to turn this into a classifier. First of all, \(\hat{\phi}(\boldsymbol{x})\) can be a classifier \(\hat{\phi}_m(\boldsymbol{x}) \in \{0,1\}\) or a probability prediction \(\hat{\phi}_m(\boldsymbol{x}) \in [0,1]\). One option that makes sense in either case is to take the majority vote:

\[ \hat{y}_{GAM}(\boldsymbol{x}) = \mathrm{I}\left(\frac{1}{M} \sum_{m=1}^M \hat{\theta}_m \hat{\phi}_m(\boldsymbol{x}) > 0.5\right). \]

In the case that \(\hat{\phi}_m(\boldsymbol{x}) \in \{0,1\}\), we could also take the majority vote after thresholding each node:

\[ \hat{y}_{GAM}(\boldsymbol{x}) = \mathrm{I}\left(\frac{1}{M} \sum_{m=1}^M \mathrm{I}\left(\hat{\theta}_m \hat{\phi}_m(\boldsymbol{x}) > 0.5\right) > 0.5\right). \]

In general, these can give different answers.

Boosting as greedy forward stagewise selection

Depending on the structure of the \(\phi_m\), the GAM optimization problem may be intractable (Hastie, Tibshirani, and Friedman (2009) 9.1 discusses some approaches that minimize the GAM model loss iteratively for non–tree structures on \(\phi\).) However, we can take a “leaf” from the CART playbook and peform forward stagewise greedy minimization of the GAM loss by fitting the \(\phi_m\) incrementally, one at a time.

Hastie, Tibshirani, and Friedman (2009) Algorithm 10.2:

Intialize \(\hat{f}_0(\cdot) = 0\).
For \(m=1,\ldots,M\):
1. Compute \(\hat{\theta}_m, \hat{\phi}_m := \underset{\theta, \phi}{\mathrm{argmin}}\, \frac{1}{N} \sum_{n=1}^N\mathscr{L}(\hat{f}_{m-1}(\boldsymbol{x}_n) + \theta \phi(\boldsymbol{x}_n), y_n)\)
2. Set \(\hat{f}_{m}(\cdot) = \hat{f}_{m-1}(\cdot) + \hat{\theta}_m \hat{\phi}_m(\cdot)\)
Return \(\hat{f}_M(\cdot) = \sum_{m=1}^M \hat{\theta}_m \hat{\phi}_m(\cdot)\)

Boosting as an improvement to weak learners

For boosting, we first find the \(\hat{f}_{1}(\cdot)\) that best predicts \(y_n\), then the one that makes the biggest improvement when added to \(\hat{f}_{1}(\cdot)\), and so one. Note that at each step all we need to be able to do is make a small improvement, since over \(M\) steps these small improvments accumulate. For example, we might take \(\phi_m(\cdot)\) to simply be a “stump:” a regression tree with a single split. In this sense, Boosting can improve on high–biased estimators by adding up a large number of them in an increasingly expressive way.

Different loss functions and classification trees

For squared error loss, this is the same as incrementally fitting the residuals, since

\[ \begin{aligned} \mathscr{L}(\hat{f}_{m-1}(\boldsymbol{x}_n) + \theta \phi(\boldsymbol{x}_n), y_n) ={}& \left(\hat{f}_{m-1}(\boldsymbol{x}_n) + \theta \phi(\boldsymbol{x}_n) - y_n \right)^2 \\={}& \left(\theta \phi(\boldsymbol{x}_n) - (y_n - \hat{f}_{m-1}(\boldsymbol{x}_n)) \right)^2 \\={}& \mathscr{L}(\theta \phi(\boldsymbol{x}_n), y_n - \hat{f}_{m-1}(\boldsymbol{x}_n)). \end{aligned} \]

However, with more complicated loss functions, the inner problem

\[ \hat{\theta}_m, \hat{\phi}_m := \underset{\theta, \phi}{\mathrm{argmin}}\, \frac{1}{N} \sum_{n=1}^N\mathscr{L}(\hat{f}_{m-1}(\boldsymbol{x}_n) + \theta \phi(\boldsymbol{x}_n), y_n) \]

may not be easy.

For example, consider classification trees decided by majority vote. There is no obvious notion of “residual” that you can train the next classification tree on. It happens that, with a particular form of exponential loss \(\mathscr{L}(f(\boldsymbol{x}), y) = \exp((1 - 2y) f(\boldsymbol{x}))\), the inner problem can be solved directly using an early algorithm known as “AdaBoost” after “Adaptive Boosting” (Hastie, Tibshirani, and Friedman (2009) 10.4). However, for more general losses, an alternative formulation is needed, and a solution is provided by gradient boosting.

Gradient boosting

\[ \def\gv{\boldsymbol{g}} \def\deltahat{\hat{\delta}} \def\deltav{\boldsymbol{\delta}} \def\deltavhat{\hat{\deltav}} \]

Consider the slightly modified problem of finding \(\deltahat = (\deltahat_1, \ldots, \deltahat_N)\) that makes our loss better:

\[ \deltavhat := \underset{\deltav}{\mathrm{argmin}}\, \frac{1}{N} \sum_{n=1}^N\mathscr{L}(\hat{f}_{m-1}(\boldsymbol{x}_n) + \deltahat_n, y_n) \]

A gradient descent step that improves the loss would take the form

\[ \deltahat_n := \varepsilon g_n \quad\textrm{where}\quad g_n =\frac{\partial \mathscr{L}(\hat{f}_{m-1}(\boldsymbol{x}_n) + \delta_n, y_n)}{\partial \delta_n} = \left. \frac{\partial \mathscr{L}(f, y_n)}{\partial f} \right|_{f= \hat{f}_{m-1}(\boldsymbol{x}_n)}. \]

Taking an actual step in this direction has a big problem in that it gives no information about what to do on \(\boldsymbol{x}\) that do not occur in the dataset. So we instead find a function in some class which is defined on the whole \(\boldsymbol{x}\) domain that best aligns with the gradient using squared loss:

\[ \hat{\phi}_m := \underset{\phi}{\mathrm{argmin}}\, \frac{1}{N} \sum_{n=1}^N\left(\phi(\boldsymbol{x}_n) - g_n\right)^2. \]

We then take an optimal step in the direction \(\hat{\phi}_m\):

\[ \hat{\theta}_m := \underset{\theta}{\mathrm{argmin}}\, \frac{1}{N} \sum_{n=1}^N\mathscr{L}(\hat{f}_{m-1}(\boldsymbol{x}_n) + \theta \hat{\phi}_m(\boldsymbol{x}_n), y_n), \]

and set

\[ \hat{f}_m(\cdot) = \hat{f}_{m-1}(\cdot) + \hat{\theta}_m \hat{\phi}_m(\cdot). \]

It can readily be seen that this is a forward stepwise greedy approach which applies to any differentiable loss function.

References

Gareth, J., W. Daniela, H. Trevor, and T. Robert. 2021. An Introduction to Statistical Learning: With Applications in Python. Spinger.

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer.