The kernel trick

Author

Ryan Giordano

Goals

Introduce the “kernel trick” for linear regression as a way of efficiently dealing with very high–dimensional regressors
Introduce and motivate the polynomial kernel

Reading

I will be taking readings from Schölkopf and Smola (2002) and Rasmussen and Williams (2006), both of which are available digitally from the UC Berkeley library website. I’ll also use Hastie, Tibshirani, and Friedman (2009) Chapter 12.

Monomials: a high–dimensional regression problem

We’ll begin our discussion of kernels with a motivating regression problem. The only purpose of this problem is to motivate regression on higher–order polynomials.

Suppose we have a regressor \(\boldsymbol{r}\in [0,1)^R\), which we want to use to predict a real–valued response \(y\in \mathbb{R}^{}\) using ridge regression. (For today we’ll keep the ridge parameter \(\lambda\) fixed rather than estimating it via cross–validation.)

As usual, we want to admit an expressive model by transforming \(\boldsymbol{r}\). Since each component of \(\boldsymbol{r}\) lies in \([0,1)\), we can form an expressive set of features by dividing the unit interval into \(M\) parts, for \(m=1,...,M\), and forming the indicator features

\[ z_{rm} = \mathrm{I}\left(\boldsymbol{r}_{r} \in \Delta_m\right) \textrm{ for }r=1,\ldots,R \textrm{ and }m=1,\ldots,M, \textrm{ where }\Delta_m = [(m-1) / M, m / M). \] There are \(D = RM\) such indicators. Stacking them vertically in a vector \(z\), we have that \(\boldsymbol{z}\in \mathbb{R}^{D}\), stacking them linearly.

Of course, the \(z_{rm}\) only check whether a single component (given by \(r\)) is in a region (given by \(m\)). To check whether a pair of components are in a region, you can consider the pairwise products \(z_{rm} z_{r'm'}\). Similar, trios of points are given by third–order monomials, and so on. You can represent the precise bin of the point \(\boldsymbol{r}\) with the product of all \(R\) indicators \(\prod_{r=1}^R z_{rm_{r}}\), where \(m_r\) picks out a bin for regressor \(r\).

Thinking this way, suppose we decide to regress on all possible \(K\)–th order monomials, that is, on the non–linear feature maps \(z_{r_1 m_1} z_{r_2 m_2} ... z_{r_K m_K}\), for every possible \(r_1, \ldots, r_K\) and \(m_1, \ldots, m_K\). There are \(P = (DM)^K\) such terms (note that there will be repeats). That’s a lot — and it’s not hard to imagine that there will be \(P \gg N\), so we have many more regressors than data points.

Let’s examine regressing on this transformation in both regression and classification.

Ridge regression

\[ \newcommand{\Um}{\underset{NP}{\boldsymbol{U}}} \newcommand{\Vm}{\underset{PN}{\boldsymbol{V}}} \]

Stepping back to the general case, suppose we have a design matrix, \(\boldsymbol{X}\), which is \(N \times P\), with \(P > N\). Recall that the ridge estimator is given by

\[ \hat{\boldsymbol{\beta}}_{L2} := (\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda {\boldsymbol{I}_{P}})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}. \]

Even though \(\boldsymbol{X}\) is only rank \(N < P\), the inverse is defined due to the ridge correction. Note each term of the \(N \times N\) matrix is an \(P\)–dimensional dot product, so this can be quite expensive to compute because of the size of the feature space.

Let’s learn a trick to avoid this heavy computation. This trick will in fact allow us to do regression tractably in infinite–dimensional feature spaces, and will lead to a different interpretation of linear regression that can help explain how interpolating estimators might still generalize.

I’ll use a linear algebra trick that Wikipedia calls the “push–through identity” (Wikipedia contributors (2024)). Suppose \(\Um\) is \(N\times P\) and \(\boldsymbol{V}\) is \(P \times N\), and both \(\Um \Vm + {\boldsymbol{I}_{N}}\) and \(\Vm\Um + {\boldsymbol{I}_{P}}\) are invertible. Then

\[ (\Um \Vm + {\boldsymbol{I}_{N}}) \Um = \Um (\Vm \Um + {\boldsymbol{I}_{P}}) \quad\Leftrightarrow\quad \Um (\Vm \Um + {\boldsymbol{I}_{P}})^{-1} = (\Um \Vm + {\boldsymbol{I}_{N}})^{-1} \Um. \]

The proof is simple, and it appears to allow you to “push” a non-square matrix inside an inverse, and pop itself out on the other side, changing the dimension of the inverse. (This can be used in part to derive the Woodbury formula, which is why it appears in Wikipedia contributors (2024)).

Applying the push–through identity to the ridge formula (see homework),

\[ \hat{\boldsymbol{\beta}}_{L2} := \boldsymbol{X}^\intercal(\boldsymbol{X}\boldsymbol{X}^\intercal+ \lambda {\boldsymbol{I}_{N}})^{-1} \boldsymbol{Y}. \]

Note that we have replaced an \(P \times P\) inverse with an \(N \times N\) inverse, and we have replaced \(\boldsymbol{X}^\intercal\boldsymbol{X}\) with \(\boldsymbol{X}\boldsymbol{X}^\intercal\). Interestingly, unlike \(\boldsymbol{X}^\intercal\boldsymbol{X}\), the entries of \(\boldsymbol{X}\boldsymbol{X}^\intercal\) are only pairwise dot products of feature vectors:

\[ (\boldsymbol{X}\boldsymbol{X}^\intercal)_{n n'} = \boldsymbol{x}_n^\intercal\boldsymbol{x}_{n'}. \]

Naively, there are \(N^2\) entries on \(\boldsymbol{X}\boldsymbol{X}^\intercal\), and each inner product is dimension \(P\), so the cost of computing \(\boldsymbol{X}\boldsymbol{X}^\intercal\) is order \(P N^2\).. (In fact, we know that \(\boldsymbol{X}\boldsymbol{X}^\intercal\) is symmetric, so we only need to compute \(N (N + 1) / 2\) inner products, but this is still order \(N^2\), so I’ll ignore this detail.) Contrast this with computing \(X^\intercal\boldsymbol{X}\), which takes order \(P^2 N\) computations. Similarly, the inverse of \(\boldsymbol{X}\boldsymbol{X}^\intercal+ \lambda {\boldsymbol{I}_{N}}\) is only size \(N\), and so we pay a cost of roughtly \(N^3\) to perform the inverse, instead of the cost \(P^3\) to invert \(\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda {\boldsymbol{I}_{P}}\).

\(N^3\) is still a lot! But it’s much better than \(P^3\). But another gain, which has much wider implications, is an efficient way to compute the \(P\)–dimensional inner product, reducing the whole problem to order \(N^3\). This trick will eventually allow us to solve ML problems with \(P=\infty\) — an infinite number of regressors — while only paying an \(N^3\) order computational cost.

The polynomial kernel

Let’s go back to assuming that each entry of \(\boldsymbol{x}\) is \(K\)–th order monomial of some feature vector \(\boldsymbol{z}\), so that for a vectror \(s= \{1, ... , D\}^{K}\) of indices \[ x_{\boldsymbol{s}} = \prod_{k=1}^K z_{\boldsymbol{s}_k} = z_{s_1} z_{s_2} \ldots z_{s_K}. \] There are \(D^K\) such possible orderings, so stacking up all the monomials into a vector \(\boldsymbol{x}\) we get that \(\boldsymbol{x}\in \mathbb{R}^{D^K}\).

Let \(S\) be the \(D^K\)–sized set of all possible orderings, i.e., all possible values of \(\boldsymbol{s}\). Then we can write

\[ \begin{aligned} \boldsymbol{x}_{n}^\intercal\boldsymbol{x}_{m} ={}& \sum_{\boldsymbol{s}\in S} \boldsymbol{z}_{ns_1} \boldsymbol{z}_{ns_2} \ldots \boldsymbol{z}_{ns_K} \times \boldsymbol{z}_{ms_1} \boldsymbol{z}_{ms_2} \ldots \boldsymbol{z}_{ms_K} \\={}& \sum_{s_1 =1}^P \sum_{s_2 = 1}^P \ldots \sum_{s_K=1}^P \boldsymbol{z}_{ns_1} \boldsymbol{z}_{ns_2} \ldots \boldsymbol{z}_{ns_K} \times \boldsymbol{z}_{ms_1} \boldsymbol{z}_{ms_2} \ldots \boldsymbol{z}_{ms_K} \\={}& \sum_{s_1 =1}^P \sum_{s_2 = 1}^P \ldots \sum_{s_K=1}^P \boldsymbol{z}_{ns_1} \boldsymbol{z}_{ms_1} \times \boldsymbol{z}_{ns_2} \boldsymbol{z}_{ms_2} \times \ldots \boldsymbol{z}_{ns_K} \boldsymbol{z}_{ms_K} \\={}& \left(\sum_{s_1 =1}^P \boldsymbol{z}_{ns_1} \boldsymbol{z}_{ms_1} \right) \left(\sum_{s_2 =1}^P \boldsymbol{z}_{ns_2} \boldsymbol{z}_{ms_2} \right) \ldots \left(\sum_{s_K =1}^P \boldsymbol{z}_{ns_K} \boldsymbol{z}_{ms_K} \right) \\={}& \left(\boldsymbol{z}_n^\intercal\boldsymbol{z}_m \right)^K. \end{aligned} \]

This simple re-write transformed a sum of \(P = D^K\) terms with a single sum of \(D\) terms, raised to the power \(K\). That’s a big computational savings when \(D\) is large! Let us define the “kernel function” \[ \mathcal{K}_{\,}(\boldsymbol{z}, \boldsymbol{z}') = \left(\boldsymbol{z}_n^\intercal\boldsymbol{z}_m \right)^K. \]

We can then write

\[ (\boldsymbol{X}\boldsymbol{X}^\intercal)_{nm} = \mathcal{K}_{\,}(\boldsymbol{z}_n, \boldsymbol{z}_m), \]

so that evaluating \(\boldsymbol{X}\boldsymbol{X}^\intercal\) costs only order \(P N^2\) rather than order \((P^K) N^2\). This is a massive savings when \(P\) is large.

Note that we don’t even need to compute the inner product when forming a prediction at a new datapoint, since

\[ \boldsymbol{x}_{\mathrm{new}}^\intercal\hat{\boldsymbol{\beta}}= \boldsymbol{x}_{\mathrm{new}}^\intercal\boldsymbol{X}(\boldsymbol{X}\boldsymbol{X}^\intercal+ \lambda {\boldsymbol{I}_{\,}}{N})^{-1} \boldsymbol{Y}, \]

and the \(n\)–th entry of \(\boldsymbol{x}_{\mathrm{new}}^\intercal\boldsymbol{X}\) is given by \(\mathcal{K}_{\,}(\boldsymbol{x}_{\mathrm{new}}, \boldsymbol{x}_n)\).

The kernel trick

In these examples, we went from a particular feature to a kernel function. However, upon inspection there are some funny things about this feature space. You might ask whether, in general, you can take a particular featurization and turn it into a computationally efficient kernel. The answer appears to be no, in general.

But that may be the wrong question. One might instead ask: why start with a featurization at all? Why not start with a computationally efficient kernel, and simply replace the dot product with that kernel? What kinds of kernel functions can you use? What is the implicit featurization for a kernel? And are there other algorithms other than ridge that admit this trick? These are the questions we’ll answer in the coming unit.

References

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer.

Rasmussen, C., and C. Williams. 2006. Gaussian Processes for Machine Learning. Vol. 2. 3. MIT press Cambridge, MA.

Schölkopf, B., and A. Smola. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press.

Wikipedia contributors. 2024. “Woodbury Matrix Identity — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Woodbury_matrix_identity&oldid=1253894787.