Homework 4 (due April 19th 9pm)

Stat 154/254: Statistical Machine Learning

1 Draw your own partitions (based on Gareth et al. () 8.4.1)

(a)

Draw an example of your own invention of a partition of a two–dimensional feature space that could result from a recurive binary splitting. Your example should contain at least six regions. Draw a decision tree corresponding to this partition. Be sure to label all aspects of your figures, including the regions, the cutpoints, and which region corresponds to which leaf of the tree.

(b)

Draw an example of your own invention of a partition of a two–dimensional feature space that could not result from a recurive binary splitting.

Solutions

A possible answer

2 Aggregating bagged classifiers (based on Gareth et al. () 8.4.5)

Suppose we fit classification trees to each of ten bootstrapped datasets containing regressors x and red and green classes. For a particular observation xn, each tree gives an estimate of p(Red|xn), which are:

0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75.

There are two way to combine these scores to make a single score:

  • The majority vote
  • Averaging the probabilities, and predicting red if and only if the average probability of red is >0.5.

In this example, what is the final classification under these two approaches? Explain any discrepancy in intuitive terms; a single sentence is enough.

Solutions

The classifiers are four green and six red, so the majority vote is red.

The average probability is 0.45, so the average probability approach chooses green.

The difference is due to the fact that though there are fewer green votes, their probability estimates are more extreme.

3 Trees as regression on partition indicators

Suppose {A1,,AK} is a partition of RP, meaning AkAj= for kj. Let xnRP, and let znk=I(xnAk) denote the indicators of the sets Ak, and zn=(zn1,,znK). Finally, let Z denote the N×K matrix with zn in row n, and let Y=(y1,,yN) denote a vector of real–valued responses.

Let Nk denote the number of xn in partition Ak, and let y¯k denote the average of the responses yn in Ak: y¯k=1Nkn:xnAkyn.

(a)

Show that ZZ is diagonal, with k–th diagonal entry equal to Nk.

(b)

Let β^RK denote the OLS coefficient from regressing YZβ. Show that β^=(y¯1,,y¯K).

(c)

Consider a regression tree topology T with leaves t; write tT for the set of leaves, and write nt for the set of observations xn that end up in leaf t. Given the tree T, suppose we predict a constant ct for observations x that end up in leaf t. Consider the problem of finding the optimal leaf predictions,

(c^1,,c^|T|)=argminct for tTtTnt(ynct)2.

Show directly that the optima are c^k=y¯k.

(d)

Show that your result in (c) also follows from (a) and (b). That is, show that the problem of finding the optimal predictions at the leaves of a regression tree is the same as the regression problem in part (b).

(d)

Consider the setting of (c), but for logistic loss. Suppose now that yn{0,1}, and consider

(c^1,,c^|T|)=argminct for tTtTnt(ynlogct(1yn)log(1ct)).

Show directly that the optima are c^k=π^k, where now π^k is the proportion of observations with yn=1 in the leaf.

Solutions

(a)

znk=1 if and only if xnAk, so n=1Nznk=Nk. And n=1Nznkznk=0 since each xn can be in Ak, in Ak, or in neither of them, but not in both, since the sets Ak are a partition. It follows that ZZ has the stated form by writing out the matrix multiplication as sums.

(b)

The k–th entry of ZY is the sum of yn among xnAk, and (ZZ)1 is diagonal with 1/Nk on the diagonal by (a). The result follows from β^=(ZZ)1ZY.

(c)

Each leaf is an indpendent optimization problem, so

c^t=argminctnt(ynct)2,

which we know gives c^k=y¯k.

(d)

The tree defines a partition At, where xnAt if and only if xnt (that is, if xn ends up in leaf t according to the tree topology). In leaf t we predict ct, so our prediction for yn is tTAt\ct, and the tree loss can be written equivalently as

tTnt(ynct)2=n=1N(yntinTAtct)2,

and the latter is the least squares problem from (a) and (b).

(d)

This can be done either by recognizing that this is logistic regression on indicators, or by observing again that

c^t=argminctnt(ynlogct(1yn)log(1ct)).

The first order condition gives

0=1c^tntyn+11c^tnt(1yn)=1c^tNtπ^t+11c^tNt(1π^t)0=(1c^t)π^tc^t(1π^t)=π^tc^t,

from which the result follows.

4 Interpretability of regression trees

A commonly claimed advantage of regression trees is their interpretability. That means it can be relatively easy to inspect why a regression tree made the prediction it did. But that does not mean that you can safely use trees to do variable selection for causal inference.

Suppose that you have data that’s generated according to the following process:

zN(0,10)x1|zN(z,0.1)x2|zN(z,0.1)y|z=I(z>0)

Suppose we don’t observe z, but only observe a random set of N draws (xn1,xn2,yn) from this process.

(a)

Suppose we train a regression tree on this data using xn1 and xn2 as predictors. What is the probability that the variable x1 will be selected for the first split? (Hint: you can argue by symmetry without doing any calculations.)

(b)

Suppose you make the first split on xn1. Do you expect subsequent splits on xn2 to make large improvements in the fit? (You may argue intuitively, there is no need to do calculations.)

(c)

Suppose you have trained a tree in which xn1 is the most important variable, as measured by the average increase in fit within nodes that split on x1. Does this mean that xn2, taken alone, is a poor predictor of y? Does this mean that xn1 is causally related to y?

Solutions

(a)

In the data generating process, the stochastic relationship between x1 and y is exactly the same as that between x2 and y, so by symmetry the probability of choosing either as the first split is 0.5.

(b)

Since the variance of z is so much larger than the variance of x1|z, x1 is nearly as good a predictor as z itself, and x2 can provide little extra information. Sometimes when z is very close to 0, it may be the case that you’d be better off predicting y=1 if both x1 and x2 are positive, but this happens rarely. So you will get little predictive benefit from additionally splitting on x2.

(c)

As we’ve seen in (a) and (b), the fact that x2 adds little value once you’ve split on x1 does not mean that x2 is a poor predictor taken alone, nor does it mean that it is causally related. Like L1 regularization, if several explanatory variables are all equally powerful, trees will select one of them and then discard the rest.

5 Monotonic transformations (based on a problem from Erez Buchweitz)

Suppose we are finding the optimal split point for a node t using variable rn. For a split point ρ, define

y¯ρ:=ntI(rn<ρ)ynntI(rn<ρ)y¯ρ+:=ntI(rn>ρ)ynntI(rn>ρ)R(ρ)=nt((yny¯ρ)2I(rn<ρ)+(yny¯ρ+)2I(rn>ρ)).

Typically there are a set of ρ that all produce the same value of R(ρ) lying between two datapoints r and r+. Suppose that, when this happens, we always take our split point to be the average, ρ^=12(r++r)

Consider applying CART with the above splitting procedure to regression observations of (xn,yn), and on (zn,yn), where each component of zn is a monotonic transformation of the corresponding component of xn. That is, znp=ϕp(xnp), for each p, where ϕp is either strictly increasing or strictly decreasing.

(a)

In general, will the training error be different for the two datasets? Why or why not?

(b)

In general, will the test error be different for the two datasets? Why or why not?

(c)

Construct a simple one–dimensional example in which one transformation gives a good test set error and the other does not.

Solutions

(a)

The training error will be identical. Since the transformation is either strictly increasing or decreasing, it does not change the sorting of the regressors in any leaf, and so the optimal split points will occur between the same pairs of points, irrespective of whether we split on x or z. That is, if we split between xnp and xnp at leaf t, we will split between on znp and znp at the same leaf.

(b)

As long as ϕ is non-linear, the test error may be different in general because the actual split points will be different, despite the splits being between the same pairs of datapoints. This is because averages in one space are not the same as averages in the other. Formally, you might say that:

12(xnp+xnp)=12(ϕ1(ϕ(xnp))+ϕ1(ϕ(xnp)))=12(ϕ1(znp)+ϕ1(znp))ϕ1(12(znp+znp)).

(c)

Suppose that we have two points, x1=10 and x2=10, and y=I(x>0). A classifier on x splits at 12(10+10)=0 which is the optimal point. But if we take z=exp(x), which is strictly increasing, then we split at 12(exp(10)+exp(10))11013, which corresponds to x=log(11013)=9.3, a terrible split point.

6 The accuracy of bagging (based on Bühlmann and Yu ()) (254 only )

In this problem we explore some simple settings in which bagging does and does not work.

(a)

Suppose we observe Y=(y1,,yN), which we use to fit the constant model, f=β0. We use least squares regression, so β^0=1Nn=1Nyn, and so f^=1Nn=1Nyn.

Let Y=(y1,,yN) denote a bootstrap sample from Y, and write the ideal bagging estimator as

f^=EY[1Nn=1Nyn].

Show that f^=f^. That is, bagging does not improve the estimator in this case. In fact, it makes it worse if f^ is estimated using a finite number of bootstrap samples, since it increases the variance due to the bootstrap Monte Carlo sampling.

(b)

Next, suppose that we observe pairs (xn,yn), where xnN(μ,1) and yn{0,1}. Writing X for the vectors of observations, suppose that we wish to predict yn from xn using the following classifier:

f^(x;X)=I(x1Nn=1Nxn).

Let Φ() denote the CDF of the standard normal distribution. Show that, for a fixed x,

EX[f^(x;X)]=Φ(N(xμ))VarX(f^(x;X))=Φ(N(xμ))(1Φ(N(xμ))).

(c)

In the setting of (b), let X denote a bootstrap sample from X.
Let x¯=1Nn=1Nxn. Using this, show that the bagged estimator for a fixed x and large N is

f^(x;X)Φ(N(xx¯)).

Note that the bagged estimator replaced a sharp threshold f^(x;X)=I(xx¯) with a soft threshold f^(x;X)Φ(N(xx¯)).

Hint: Under the bootstrap distribution, you may use without proof the fact that, for large N, N(1Nn=1Nxnx¯)N(0,1) under the bootstrap randomness. (Note that in the bootstrap sampling distribution, x¯ is fixed.)

(d)

Evaluating the result for (b) and (c) at x=μ, show that, for large N,

EX[f^(x;X)]12EX[f^(x;X)]=12VarX(f^(x;X))112VarX(f^(x;X))=14.

Conclude that, for x=μ, the bagged estimator maintains the same expected prediction function, but reduces the variance by a factor of 3.

Hint: use the fact that N(μx¯) is a standard normal, and that Φ(Z) is uniform if ZN(0,1).

Solutions

(a)

This was done in lecture.

(b)

By normality, 1Nn=1NxnN(μ,N1), so N(1Nn=1Nxnμ)N(0,1). Let zN(0,1). Then

EX[f^(x;X)]=P(x1Nn=1Nxn)=P(N(xμ)z)=Φ(N(xμ)).

Since I(x1Nn=1Nxn) is a binary variable, with expectation p=Φ(N(xμ)), its variance is p(1p), and plugging in gives the desired answer.

(c)

The reasoning is as in (b) but using the CLT applied to the bootstrap distribution.

f^(x;X)=EX[f^(x;X)]=EX[I(x1Nn=1Nxn)]=EX[I(N(xx¯)N(1Nn=1Nxnx¯))]EX[I(N(xx¯)Z)]=Φ(N(xx¯)).

(d)

Evaluating the result for (b) and (c) at x=μ, show that

EX[f^(x;X)]=EX[Φ(N(μx¯))]=EX[Φ(Z)]=01udu=12.

and

VarX(f^(x;X))=01u2du14=1314=112.

7 Trees as approximations to linear functions (254 only )

Suppose you are trying to use a regression tree to represent the function y=E[y|x]:=f(x)=x for xR. That is, y is observed without noise and the regression function is linear, but we’re approximating the linear function with a decision tree. Assume that x is uniformly distributed on [0,1].

(a)

Consider a connected set A with endpoints AL and AR (i.e. A={x:ALX<AR}), and let Δ=ARAL denote the width of A.

Suppose we predict yA for xA, where yA=argmincA(f(x)c)2dx is the ideal least squares estimator in the set A. Show that yA=12(AL+AR), and that the integrated error is

A(f(x)yA)2dx=Δ312.

(b)

Suppose we partition the unit interval into K equally spaced intervals, each with width 1/K, with indicator functions Ak. Suppose we find the optimal least–squares prediction y(x) for the regression of y on these indicators. Using (a), find an expression for the expected squared error Ex,y[(yy(x))2] as a function of K.

(c)

In the setting of (b), for a fixed ε>0, how large do we need to make K in order to acheive Ex,y[(yy(x))2]ε?

(d)

Suppose we grow a regression tree with depth D, where each branch is split at the optimal point in the middle of the parent’s interval. For a fixed ε>0, how large do we need to make D in order to acheive Ex,y[(yy(x))2]ε?

Note: It seems to me that computer scientists love to do accuracy order calculations like this!

Solutions

(a)

All we really need is the two integrals

abxdx=12[x2]ab=12(b2a2) andabx2dx=13[x3]ab=13(b3a3).

We know that yA satisfies the first–order condition

cA(f(x)c)2dx=2A(f(x)c)dx=0

so yA=Af(x)dxAdx=12(AR2AL2)ARAL=12(AR+AL).

A simple way to do the second integral is to note that 01(x1/2)2dx=112, so that 01(ΔxAL(Δ/2+AL))2dx=Δ212. Making the subsititution z=ΔxAL into the integral gives the desired result.

(b)

Ex,y[(yy(x))2]=Ex[k=1KI(k1K<x<kK)(xy(x))2]=k=1K(k1)/Kk/K(xyk)2dx=K112K3=112K2,

where we used (a) in the second-to-last line.

(c)

By (b) we require

112K2εK112ε.

Since K must be an integer, we can take K=112ε.

(d)

The optimal split point is always in the middle of the interval, so if we have a tree of depth D then we have a partition into K=2D regions. (Here I take a depth one tree to be a single split, but the question is ambiguous, and taking a depth one tree to be a single node is also fine, in which case you get K=2D1.)

By (c) we then require

2D=exp(Dlog(2))112εDlog(12ε)log(2).

Again, D must be an integer, so we can take D=log(12ε)log(2).

8 Bibliography

Bühlmann, Peter, and Bin Yu. 2002. Analyzing bagging.” The Annals of Statistics 30 (4): 927–61. https://doi.org/10.1214/aos/1031689014.
Gareth, J., W. Daniela, H. Trevor, and T. Robert. 2021. An Introduction to Statistical Learning: With Applications in Python. Spinger.