Bagging and boosting

Author

Ryan Giordano

Goals

Introduce the first of black–box modifications of a “weak learner”: - Introduce bagging - Intuitively for high–variance, low bias procedures - Bootstrap - Averaging the bootstrap - Decorrelating the trees (random forests)

Reading

The primary reading is Gareth et al. (2021) 8.2. The reading in Hastie, Tibshirani, and Friedman (2009) section 8.7 may also be useful. These notes are intended only as a supplement to the primary reading.

Bagging

Suppose we’re using squared error loss, and we have a learner that is approximately unbiased but high variance. That is, for a particular fixed \(\boldsymbol{x}\), we have \(\hat{f}(\boldsymbol{x}; \mathcal{D})\), where \(\hat{f}(\cdot; \mathcal{D})\) depends on the training data \(\mathcal{D}\), which satisfies

\[ \mathbb{E}_{\mathcal{D}}\left[\hat{f}(\boldsymbol{x}; \mathcal{D})\right] \approx f^{\star}(\boldsymbol{x}) \quad\textrm{and}\quad \mathrm{Var}_{\mathcal{D}}\left(\hat{f}(\boldsymbol{x}; \mathcal{D})\right) \textrm{ is large.} \]

Here, \(\mathbb{E}_{\,}\left[y\vert \boldsymbol{x}\right] = f^{\star}(\boldsymbol{x})\) is the best possible prediciton we could make.

One might think of deep regression trees as being a possible example. Intuitively, one might expect such a predictor to be over–fitting the particular dataset at hand.

Recall that we showed earlier that the expected squared error decomposes into

\[ \mathbb{E}_{\mathcal{D}}\left[(\hat{f}(\boldsymbol{x}; \mathcal{D}) - f^{\star}(\boldsymbol{x}))^2\right] = \underbrace{ \mathbb{E}_{\mathcal{D}}\left[\left(\hat{f}(\boldsymbol{x}; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}\left[\hat{f}(\boldsymbol{x}; \mathcal{D})\right]\right)^2\right] }_{\textrm{Variance}} + \underbrace{ \left(\mathbb{E}_{\mathcal{D}}\left[\hat{f}(\boldsymbol{x}; \mathcal{D})\right] - f^{\star}(\boldsymbol{x})\right)^2 }_{\textrm{Bias}} . \]

The variance only increases the expected squared error, so if we could get a lower–variance estimator without changing the bias, we’d improve our loss. Ideally, we might imagine getting \(B\) different, new datasets \(\mathcal{D}_1, \ldots, \mathcal{D}_B\), and then using

\[ \hat{f}_{Ideal}(\boldsymbol{x}) = \frac{1}{B} \sum_{b=1}^B \hat{f}(\boldsymbol{x}; \mathcal{D}_b) \approx \mathbb{E}_{\mathcal{D}}\left[\hat{f}(\boldsymbol{x}; \mathcal{D})\right]. \]

Of course, if we could actually do this, we should just combine the datasets into one giant dataset. But this idea motivates the feasible “bagging” estimator:

\[ \hat{f}_{Bagging}(\boldsymbol{x}) = \frac{1}{B} \sum_{b=1}^B \hat{f}(\boldsymbol{x}; \mathcal{D}_b^*), \]

where \(\mathcal{D}_b^*\) is a bootstrap sample from the original dataset. The bootstrap distribution is a random distribution conditional on the original dataset \(\mathcal{D}\), with probability distribution

\[ \mathbb{P}_{\,}\left((\boldsymbol{x}^*_n, y^*_n) = (\boldsymbol{x}_n, y_n)\right) = \frac{1}{N} \quad\textrm{for }\quad n=1,\ldots,N. \]

The problem is, of course, that bootstrap draws are not draws from \(\mathcal{D}\). Why does boosting work? In my opinion, the theory is not very clear, but it appears in practice that boosting tends to improve highly variable and expressive estimators, but can make less variable estimators worse.

Linear estimators are unaffected by bagging

To concretize the fact that boosting does not necessarily produce an improvement, we can consider the sample mean \(\hat{\mu}= \frac{1}{N} \sum_{n=1}^Ny_n\) as an estimator of \(\mathbb{E}_{\,}\left[y\right]\). For a particular bootstrap sample,

\[ \hat{\mu}^*_b := \frac{1}{N} \sum_{n=1}^Ny_n^*, \]

and the bagged estimator is

\[ \begin{aligned} \frac{1}{B} \sum_{b=1}^B \hat{\mu}^*_b \approx{}& \mathbb{E}_{\mathcal{D}^*}\left[\hat{\mu}^*_b\right] \\={}& \mathbb{E}_{\mathcal{D}^*}\left[\frac{1}{N} \sum_{n=1}^Ny_n^*\right] \\={}& \frac{1}{N} \sum_{n=1}^N\mathbb{E}_{\mathcal{D}^*}\left[y_n^*\right]. \end{aligned} \]

Now, under the bootstrap distribution,

\[ \mathbb{E}_{\mathcal{D}^*}\left[y_n^*\right] = \sum y_n \mathbb{P}_{\,}\left(y_n^* = y_n\right) = \frac{1}{N} \sum_{n=1}^Ny_n = \hat{\mu}. \]

So the limiting bagging estimator is

\[ \mathbb{E}_{\mathcal{D}^*}\left[\hat{\mu}^*_b\right] = \hat{\mu}. \]

This is nothing more than our original estimator! Bagging has done nothing other than increase our variance (by using \(B < \infty\) bootstrap samples).

The fact that bagging only makes things worse in this case has nothing to do with the bias or variance of \(\hat{\mu}\), it only depends on the fact that \(\hat{\mu}\) is a linear function of sample moments of the data. It follows that boosting can only have an effect on estimators that are nonlinear functions of the sample moments of the data. Linear regression is only weakly nonlinear (through the nonlinearity of \((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\)), and are mostly not improved by bagging. But random forests are more highly nonlinear through the selection of the tree structure.

Random forests

If the models are highly correlated, we don’t gain much. An example is if there is one “best” predictor that always gets chosen first. We can add variance by restricting each tree to use only a randomly chosen subset of the explanatory variables.

References

Gareth, J., W. Daniela, H. Trevor, and T. Robert. 2021. An Introduction to Statistical Learning: With Applications in Python. Spinger.
Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer.