Homework 4 (due April 19th 9pm)
Stat 154/254: Statistical Machine Learning
1 Draw your own partitions (based on Gareth et al. (2021) 8.4.1)
(a)
Draw an example of your own invention of a partition of a two–dimensional feature space that could result from a recurive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all aspects of your figures, including the regions, the cutpoints, and which region corresponds to which leaf of the tree.
(b)
Draw an example of your own invention of a partition of a two–dimensional feature space that could not result from a recurive binary splitting.
2 Aggregating bagged classifiers (based on Gareth et al. (2021) 8.4.5)
Suppose we fit classification trees to each of ten bootstrapped datasets containing regressors
0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75.
There are two way to combine these scores to make a single score:
- The majority vote
- Averaging the probabilities, and predicting red if and only if the average probability of red is
.
In this example, what is the final classification under these two approaches? Explain any discrepancy in intuitive terms; a single sentence is enough.
The classifiers are four green and six red, so the majority vote is red.
The average probability is 0.45, so the average probability approach chooses green.
The difference is due to the fact that though there are fewer green votes, their probability estimates are more extreme.
3 Trees as regression on partition indicators
Suppose
Let
(a)
Show that
(b)
Let
(c)
Consider a regression tree topology
Show directly that the optima are
(d)
Show that your result in (c) also follows from (a) and (b). That is, show that the problem of finding the optimal predictions at the leaves of a regression tree is the same as the regression problem in part (b).
(d)
Consider the setting of (c), but for logistic loss. Suppose now that
Show directly that the optima are
(a)
(b)
The
(c)
Each leaf is an indpendent optimization problem, so
which we know gives
(d)
The tree defines a partition
and the latter is the least squares problem from (a) and (b).
(d)
This can be done either by recognizing that this is logistic regression on indicators, or by observing again that
The first order condition gives
from which the result follows.
4 Interpretability of regression trees
A commonly claimed advantage of regression trees is their interpretability. That means it can be relatively easy to inspect why a regression tree made the prediction it did. But that does not mean that you can safely use trees to do variable selection for causal inference.
Suppose that you have data that’s generated according to the following process:
Suppose we don’t observe
(a)
Suppose we train a regression tree on this data using
(b)
Suppose you make the first split on
(c)
Suppose you have trained a tree in which
(a)
In the data generating process, the stochastic relationship between
(b)
Since the variance of
(c)
As we’ve seen in (a) and (b), the fact that
5 Monotonic transformations (based on a problem from Erez Buchweitz)
Suppose we are finding the optimal split point for a node
Typically there are a set of
Consider applying CART with the above splitting procedure to regression observations of
(a)
In general, will the training error be different for the two datasets? Why or why not?
(b)
In general, will the test error be different for the two datasets? Why or why not?
(c)
Construct a simple one–dimensional example in which one transformation gives a good test set error and the other does not.
(a)
The training error will be identical. Since the transformation is either strictly increasing or decreasing, it does not change the sorting of the regressors in any leaf, and so the optimal split points will occur between the same pairs of points, irrespective of whether we split on
(b)
As long as
(c)
Suppose that we have two points,
6 The accuracy of bagging (based on Bühlmann and Yu (2002)) (254 only )
In this problem we explore some simple settings in which bagging does and does not work.
(a)
Suppose we observe
Let
Show that
(b)
Next, suppose that we observe pairs
Let
(c)
In the setting of (b), let
Let
Note that the bagged estimator replaced a sharp threshold
Hint: Under the bootstrap distribution, you may use without proof the fact that, for large
(d)
Evaluating the result for (b) and (c) at
Conclude that, for
Hint: use the fact that
(a)
This was done in lecture.
(b)
By normality,
Since
(c)
The reasoning is as in (b) but using the CLT applied to the bootstrap distribution.
(d)
Evaluating the result for (b) and (c) at
and
7 Trees as approximations to linear functions (254 only )
Suppose you are trying to use a regression tree to represent the function
(a)
Consider a connected set
Suppose we predict
(b)
Suppose we partition the unit interval into
(c)
In the setting of (b), for a fixed
(d)
Suppose we grow a regression tree with depth
Note: It seems to me that computer scientists love to do accuracy order calculations like this!
(a)
All we really need is the two integrals
We know that
so
A simple way to do the second integral is to note that
(b)
where we used (a) in the second-to-last line.
(c)
By (b) we require
Since
(d)
The optimal split point is always in the middle of the interval, so if we have a tree of depth
By (c) we then require
Again,