Making new regressors from old

Author

Ryan Giordano

Goals

Motiviate making new regressors from old in an ML context
Introduce some useful classes of non–linear transformations
- Polynomial series
- One–hot encodings
- Splines

Reading

These lecture notes are a supplement for

Gareth et al. (2021) Chapters 3.3, 7.1–7.5
Hastie, Tibshirani, and Friedman (2009) Chapters 5.1–5.2 and appendix

New regressors from old

Recall that we are trying to approximate

\[ \mathbb{E}_{\,}\left[y| \boldsymbol{x}\right] \approx \boldsymbol{\beta}^\intercal\boldsymbol{x}. \]

In general, the left–hand side $\mathbb{E}_{\,}\left[y| \boldsymbol{x}\right]$ is not linear in the regressors, so linear regression will not be able to characterize $\mathbb{E}_{\,}\left[y| \boldsymbol{x}\right]$. How close can it get? The more regressors you include, the more expressive $\boldsymbol{\beta}^\intercal\boldsymbol{x}$ is. But typically you only have a finite set of regressors. Does this upper bound the complexity of the functions you can fit?

It turns out that it does not, since we can take non–linear transformations of the regressors to make new regressors.

Taylor series

Suppose for simplicity that $x$ is a scalar, and that $\mathbb{E}_{\,}\left[y| x\right]$ is an analytic function, meaning it admits a Taylor series expansion:

\[ \mathbb{E}_{\,}\left[y| x\right] = \beta^{*}_0 + \beta^{*}_1 x+ \frac{1}{2} x^2 + \ldots + \frac{1}{k!} \beta^{*}_k x^k + \ldots = \sum_{k=0}^\infty \frac{1}{k!} \beta^{*}_k x^k. \]

Effectively, the Taylor series means that the true conditional expectation is in fact a linear function, but of an infinite number of regressors. We can imagine truncating at some $K$ and running the regression using the features $\boldsymbol{z}_n = (1, x, \frac{1}{2} x^2, \ldots, \frac{1}{K!} x^K)$. The error we commit is $\sum_{k=K+1}^\infty \frac{1}{k!} \beta^{*}_k x^k$, which goes to zero as $K \rightarrow \infty$.

The key is that $\boldsymbol{z}_n$ is a non–linear function of $\boldsymbol{x}_n$. In general, linear regression on non–linear functions of the regressors give more expressive function classes.

Note that you would have numerical problems if you actually used $\boldsymbol{z}_n$, since the matrix $\boldsymbol{Z}^\intercal\boldsymbol{Z}$ will typically be nearly numerically singular. Instead, you should use an orthogonal system of polynomials (like Legendre polyonmials). But typically you wouldn’t actually do this, you’d use splines (discussed below).

Taylor series and interactions

Suppose you have two regressors, $x$ and $z$. Compare the two regressions:

\[ \begin{aligned} f_1(x, z| \beta) :={}& \beta_0 + \beta_x x+ \beta_z z\quad \textrm{versus} \\ f_2(x, z| \gamma) :={}& \gamma_0 + \gamma_x x+ \gamma_z z+ \gamma_{xx} x^2 + \gamma_{zz} z^2 + \gamma_{xz} xz. \end{aligned} \]

The second regression contains an interaction between $x$ and $z$. In $f_1$, the effect on the prediction $f_1$ of an increase in $x$ is the same no matter what $z$ is. In $f_2$, the effect of an increase in $x$ depends on $z$ — there is an interaction between $x$ and $z$ in the model.

In fact, $f_2$ is precisely a second–order Taylor series expansion of the regressors from $f_1$. A third–order Taylor series expansion models third–order interactions, and so on.

Indicator functions

You have probably seen how to turn categorical variables into regressors using one–hot encodings. A simlar trick can be done with continuous random variables. Again, take the regressor $x$ to be a scalar for simplicity. We can define a categorical variable “Is $x$ is positive?” If we turn this into a one–hot encoding, we get

\[ z_n = \mathrm{I}\left(x> 0\right) = \begin{cases} 1 & \textrm{ if }x> 0 \\ 0 & \textrm{ otherwise}. \end{cases} \]

If we regress $y_n \sim 1 + z_n$, our predictions $\hat{y}_\mathrm{new}$ will be simply the averages of $y_n$ among negative and positive training instances, respectively, according to whether $x_\mathrm{new}$ is postive or negative.

Similarly, for any partition $\zeta_0, \ldots, \zeta_K$ of the $x$–axis, we can construct the indicators

\[ \boldsymbol{z}^0_n = \begin{pmatrix} \mathrm{I}\left(\zeta_0 \le x< \zeta_1\right) \\ \mathrm{I}\left(\zeta_1 \le x< \zeta_2\right) \\ \vdots\\ \mathrm{I}\left(\zeta_k \le x< \zeta_{k+1}\right) \\ \vdots\\ \mathrm{I}\left(\zeta_{K-1} \le x< \zeta_{K}\right) \\ \end{pmatrix}. \]

Regressing on $\boldsymbol{z}_n$ so constructed produces piecewise linear fits within the intervals, within which the regression returns the average of the training responses.

We will see later that tree–based methods (for example, random forests) are exactly of this form, but in multiple dimensions, and where the partitions are encoded by decision trees and chosen optimally in some sense. However, keeping the trees fixed, tree–based methods are simply regressing on indicators in this way!

Splines

One problem with indicator functions is that the resulting function approximations are “bumpy” — they are discontinuous and have zero derivatives everywhere.

However, the indicator functions can be modified in a simple way to produce piecewise linear functions. In addition to the regressors $\boldsymbol{z}^0_n$ defined above, define

\[ \boldsymbol{z}^1_n = \begin{pmatrix} \mathrm{I}\left(\zeta_0 \le x< \zeta_1\right) (x- \zeta_0) \\ \vdots\\ \mathrm{I}\left(\zeta_k \le x< \zeta_{k+1}\right)(x- \zeta_k) \\ \vdots\\ \mathrm{I}\left(\zeta_{K-1} \le x< \zeta_{K}\right)(x- \zeta_{K-1}) \\ \end{pmatrix}. \]

Regressing $y_n \sim \boldsymbol{z}^0_n + \boldsymbol{z}^1_n$ produces a piecewise linear fit. Effectively, it produces a separate first–order Taylor series approximation within each segment. The idea can be generalized using regressors of the form

\[ \boldsymbol{z}^p_n = \begin{pmatrix} \mathrm{I}\left(\zeta_0 \le x< \zeta_1\right) (x- \zeta_0)^p \\ \vdots\\ \mathrm{I}\left(\zeta_k \le x< \zeta_{k+1}\right)(x- \zeta_k)^p \\ \vdots\\ \mathrm{I}\left(\zeta_{K-1} \le x< \zeta_{K}\right)(x- \zeta_{K-1})^p \\ \end{pmatrix}, \]

regressing on which produce $p$–th order polynomials within each bucket.

Using splines for continuous functions

One problem with the preceding construction is that the fitted functions are still (in general) discontinuous. For example, consider the piecewise linear regression function

\[ f_2(x) = \boldsymbol{\beta}^\intercal\boldsymbol{z}^0_n + \boldsymbol{\gamma}^\intercal\boldsymbol{z}^1_n. \]

In between $\zeta_0$ and $\zeta_1$, $f_2(x)$ is linear, and so continuous. However, it may have different values at the knots. For example,

\[ \begin{aligned} f_2(\zeta_1^- | \boldsymbol{\beta}) ={}& \beta_1 + \gamma_1 (x- \zeta_0) = \beta_1 + \gamma_1 (\zeta_1 - \zeta_0) & \textrm{ just to the left of $\zeta_1$} \\ f_2(\zeta_1^+ | \boldsymbol{\beta}) ={}& \beta_2 + \gamma_2 (x- \zeta_1) = \beta_2. & \textrm{ just to the right of $\zeta_1$} \\ \end{aligned} \]

We can thus enforce continuity by requiring

\[ \begin{aligned} \beta_1 + \gamma_1 (\zeta_1 - \zeta_0) ={}& \beta_2 \\ \beta_2 + \gamma_2 (\zeta_2 - \zeta_1) ={}& \beta_3 \\ \vdots& \\ \beta_{K-1} + \gamma_{K-1} (\zeta_{K-1} - \zeta_{K}) ={}& \beta_K \\ \end{aligned} \]

Note that this amounts to $K-1$ linear constraints for $K$ intervals — one at each boundary — which can, in general, be written in the form

\[ A \begin{pmatrix} \boldsymbol{\beta}\\ \boldsymbol{\gamma} \end{pmatrix} = 0. \]

This linear restriction reduces the free parameter space from $2K$ dimensions to $2K - (K - 1)$, and can be produced by linearly reparameterizing the equations. Note that the first derivatives of a piecewise linear approximation cannot be made continuous. However, the first derivatives of a piecewise quadratic approximation can be, and so regressing on $\boldsymbol{z}^0_n$, $\boldsymbol{z}^1_n$, and $\boldsymbol{z}^2_n$ can be expressed using linear constraints as a function with continuous first derivatives but piecewise discontinuous second derivatives, and so on.

Standard nomenclatur describes an “order–$M$” spline as consisting of piecewise $M-1$–order polynomials, with $M-2$ continuous derivatives. For example, a continuous piecewise quadratic spline consists of order $2$ polynomials, has $1$ continuous derivatives, and is a an order–$3$ spline. e (Hastie, Tibshirani, and Friedman (2009) 5.2)

In practice, we do not select a basis and then analytically derive linear constraints. Instead, we can use an iterative construction known as B–splines which gives an explicit representation of the same basis in which continuity is automatically enforced. See the appendix to Hastie, Tibshirani, and Friedman (2009) Chapter 5.

Extrapolation

Note that there are different things you can do beyond the boundary of your partition. Three are common:

Keep the order of the spline the same (this can result in crazy behavior)
Restrict the order beyond the boundary to be linear or constant

Usually the latter does better, because usually realistic functions are not in fact unbounded polynomials. However, in a sense, the training data cannot tell you what goes on beyond its own extrema.

Bibliography

Gareth, J., W. Daniela, H. Trevor, and T. Robert. 2021. An Introduction to Statistical Learning: With Applications in Python. Spinger.

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer.