The kernel trick
Goals
- Introduce the “kernel trick” for linear regression as a way of efficiently dealing with very high–dimensional regressors
- Introduce and motivate the polynomial kernel
Reading
I will be taking readings from Schölkopf and Smola (2002) and Rasmussen and Williams (2006), both of which are available digitally from the UC Berkeley library website. I’ll also use Hastie, Tibshirani, and Friedman (2009) Chapter 12.
Monomials: a high–dimensional regression problem
We’ll begin our discussion of kernels with a motivating regression problem. The only purpose of this problem is to motivate regression on higher–order polynomials.
Suppose we have a regressor \(\boldsymbol{r}\in [0,1)^R\), which we want to use to predict a real–valued response \(y\in \mathbb{R}^{}\) using ridge regression. (For today we’ll keep the ridge parameter \(\lambda\) fixed rather than estimating it via cross–validation.)
As usual, we want to admit an expressive model by transforming \(\boldsymbol{r}\). Since each component of \(\boldsymbol{r}\) lies in \([0,1)\), we can form an expressive set of features by dividing the unit interval into \(M\) parts, for \(m=1,...,M\), and forming the indicator features
\[ z_{rm} = \mathrm{I}\left(\boldsymbol{r}_{r} \in \Delta_m\right) \textrm{ for }r=1,\ldots,R \textrm{ and }m=1,\ldots,M, \textrm{ where }\Delta_m = [(m-1) / M, m / M). \] There are \(D = RM\) such indicators, and \(\boldsymbol{z}\in \mathbb{R}^{D}\), stacking them linearly.
Of course, the \(z_{rm}\) only check whether a single component (given by \(r\)) is in a region (given by \(m\)). To check whether a pair of components are in a region, you can consider the pairwise products \(z_{rm} z_{r'm'}\). Similar, trios of points are given by third–order monomials, and so on. You can represent the precise bin of the point \(\boldsymbol{r}\) with the product of all \(R\) indicators \(\prod_{r=1}^R z_{rm_{r}}\), where \(m_r\) picks out a bin for regressor \(r\).
Thinking this way, suppose we decide to regress on all possible \(K\)–th order monomials, that is, on the non–linear feature maps \(z_{r_1 m_1} z_{r_2 m_2} ... z_{r_K m_K}\), for every possible \(r_1, \ldots, r_K\) and \(m_1, \ldots, m_K\). There are \(P = (DM)^K\) such terms (note that there will be repeats). That’s a lot — and it’s not hard to imagine that there will be \(P \gg N\), so we have many more regressors than data points.
Let’s examine regressing on this transformation in both regression and classification.
Ridge regression
\[ \newcommand{\Um}{\underset{NP}{\boldsymbol{U}}} \newcommand{\Vm}{\underset{PN}{\boldsymbol{V}}} \]
Stepping back to the general case, suppose we have a design matrix, \(\boldsymbol{X}\), which is \(N \times P\), with \(P > N\). Recall that the ridge estimator is given by
\[ \hat{\boldsymbol{\beta}}_{L2} := (\boldsymbol{X}^\intercal\boldsymbol{X}+ \lambda \boldsymbol{I}_{P})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}. \]
Even though \(\boldsymbol{X}\) is only rank \(N < P\), the inverse is defined due to the ridge correction. Note each term of the \(N \times N\) matrix is an \(P\)–dimensional dot product, so this can be quite expensive to compute because of the size of the feature space.
Let’s learn a trick to avoid this heavy computation. This trick will in fact allow us to do regression tractably in infinite–dimensional feature spaces, and will lead to a different interpretation of linear regression that can help explain how interpolating estimators might still generalize.
I’ll use a linear algebra trick that Wikipedia calls the “push–through identity” (Wikipedia contributors (2024)). Suppose \(\Um\) is \(N\times P\) and \(\boldsymbol{V}\) is \(P \times N\), and both \(\Um \Vm + \boldsymbol{I}_{N}\) and \(\Vm\Um + \boldsymbol{I}_{P}\) are invertible. Then
\[ (\Um \Vm + \boldsymbol{I}_{N}) \Um = \Um (\Vm \Um + \boldsymbol{I}_{P}) \quad\Leftrightarrow\quad \Um (\Vm \Um + \boldsymbol{I}_{P})^{-1} = (\Um \Vm + \boldsymbol{I}_{N})^{-1} \Um. \]
The proof is simple, and it appears to allow you to “push” a non-square matrix inside an inverse, and pop itself out on the other side, changing the dimension of the inverse. (This can be used in part to derive the Woodbury formula, which is why it appears in Wikipedia contributors (2024)).
Applying the push–through identity to the ridge formula (see homework),
\[ \hat{\boldsymbol{\beta}}_{L2} := \boldsymbol{X}^\intercal(\boldsymbol{X}\boldsymbol{X}^\intercal+ \lambda \boldsymbol{I}_{N})^{-1} \boldsymbol{Y}. \]
Note that we have replaced an \(P \times P\) inverse with an \(N \times N\) inverse, and we have replaced \(\boldsymbol{X}^\intercal\boldsymbol{X}\) with \(\boldsymbol{X}\boldsymbol{X}^\intercal\). Interestingly, unlike \(\boldsymbol{X}^\intercal\boldsymbol{X}\), the entries of \(\boldsymbol{X}\boldsymbol{X}^\intercal\) are only pairwise dot products of feature vectors:
\[ (\boldsymbol{X}\boldsymbol{X}^\intercal)_{n n'} = \boldsymbol{x}_n^\intercal\boldsymbol{x}_{n'}. \]
We need to compute \(N (N + 1) / 2\) of these, and each inner product is dimension \(P\). Since the inverse is only size \(N\), we have gained in computational efficiency. But another gain, which has much wider implications, is an efficient way to compute the \(P\)–dimensional inner product.
The polynomial kernel
Let’s go back to assuming that each entry of \(\boldsymbol{x}\) is \(K\)–th order monomial of some feature vector \(\boldsymbol{z}\), so that for a vectror \(s= \{1, ... , D\}^{K}\) of indices
\[ \boldsymbol{x}_{\boldsymbol{s}} = \prod_{k=1}^K \boldsymbol{z}_{\boldsymbol{s}_k} = \boldsymbol{z}_{s_1} \boldsymbol{z}_{s_2} \ldots \boldsymbol{z}_{s_K}. \]
Let \(S\) be the \(D^K\)–sized set of all possible orderings, i.e., all possible values of \(\boldsymbol{s}\). Then we can write
\[ \begin{aligned} \boldsymbol{x}_{n}^\intercal\boldsymbol{x}_{m} ={}& \sum_{\boldsymbol{s}\in S} \boldsymbol{z}_{ns_1} \boldsymbol{z}_{ns_2} \ldots \boldsymbol{z}_{ns_K} \times \boldsymbol{z}_{ms_1} \boldsymbol{z}_{ms_2} \ldots \boldsymbol{z}_{ms_K} \\={}& \sum_{s_1 =1}^P \sum_{s_2 = 1}^P \ldots \sum_{s_K=1}^P \boldsymbol{z}_{ns_1} \boldsymbol{z}_{ns_2} \ldots \boldsymbol{z}_{ns_K} \times \boldsymbol{z}_{ms_1} \boldsymbol{z}_{ms_2} \ldots \boldsymbol{z}_{ms_K} \\={}& \sum_{s_1 =1}^P \sum_{s_2 = 1}^P \ldots \sum_{s_K=1}^P \boldsymbol{z}_{ns_1} \boldsymbol{z}_{ms_1} \times \boldsymbol{z}_{ns_2} \boldsymbol{z}_{ms_2} \times \ldots \boldsymbol{z}_{ns_K} \boldsymbol{z}_{ms_K} \\={}& \left(\sum_{s_1 =1}^P \boldsymbol{z}_{ns_1} \boldsymbol{z}_{ms_1} \right) \left(\sum_{s_2 =1}^P \boldsymbol{z}_{ns_2} \boldsymbol{z}_{ms_2} \right) \ldots \left(\sum_{s_K =1}^P \boldsymbol{z}_{ns_K} \boldsymbol{z}_{ms_K} \right) \\={}& \left(\boldsymbol{z}_n^\intercal\boldsymbol{z}_m \right)^K. \end{aligned} \]
This simple re-write transformed a sum of \(P = D^K\) terms with a single sum of \(D\) terms, raised to the power \(K\)! Let us define the “kernel function” \[ \mathcal{K}_{\,}(\boldsymbol{z}, \boldsymbol{z}') = \left(\boldsymbol{z}_n^\intercal\boldsymbol{z}_m \right)^K. \]
We can then write
\[ (\boldsymbol{X}\boldsymbol{X}^\intercal)_{nm} = \mathcal{K}_{\,}(\boldsymbol{z}_n, \boldsymbol{z}_m), \]
so that evaluating \(\boldsymbol{X}\boldsymbol{X}^\intercal\) costs only order \(P N (N + 1) / 2\) rather than order \((P^K) N (N + 1) / 2\). This is a massive savings when \(P\) is large.
Note that we don’t even need to comput the inner product when forming a prediction at a new datapoint, since
\[ \boldsymbol{x}_{\mathrm{new}}^\intercal\hat{\boldsymbol{\beta}}= \boldsymbol{x}_{\mathrm{new}}^\intercal\boldsymbol{X}(\boldsymbol{X}\boldsymbol{X}^\intercal+ \lambda \boldsymbol{I}_{\,}{N})^{-1} \boldsymbol{Y}, \]
and the \(n\)–th entry of \(\boldsymbol{x}_{\mathrm{new}}^\intercal\boldsymbol{X}\) is given by \(\mathcal{K}_{\,}(\boldsymbol{x}_{\mathrm{new}}, \boldsymbol{x}_n)\).
Kernelizing classification with support vector machines
We cannot use the above trick directly with logisitc regression, since we don’t have a simple closed form for the solution. In order to kernelize logistic regression we’ll need to take this approach to a higher level of mathematical abstraction. But support vector classifiers are expressible in terms of inner products, and so are amenable to the kernel trick. In fact, historically, SVCs were the entry point for kernel methods into machine learning.
Suppose we have a binary classification problem, and encode our responses as \(y_n \in \{-1, 1\}\) rather than our usual \(\{0, 1\}\), and classify with a linear function
\[ \hat{y}_n = \mathrm{I}\left(\boldsymbol{\beta}^\intercal\boldsymbol{x}+ \beta_0 \ge 0\right), \]
so that a point is correctly classified if and only if \(y_n (\boldsymbol{\beta}^\intercal\boldsymbol{x}+ \beta_0) > 0\). Supposing that the admits a separating hyperplane — that is, there exists at least one (and typically many) \(\beta_0\) and \(\boldsymbol{\beta}\) such that all the points are correctly classified. The support vector classifier finds the separating hyperplane that maximizes the minimum distance from any point to the separating hyperplane, which we formulated as finding
\[ \underset{\boldsymbol{\beta}}{\mathrm{argmin}}\, \frac{1}{2}\left\Vert\boldsymbol{\beta}\right\Vert_2^2 \quad\textrm{such that}\quad y_n(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \beta_0) \ge 1. \]
Using the Lagrange multipliers \(\boldsymbol{\alpha}\in [0,\infty)^N\) to enforce the constraint, we can equivalently write the SVC problem as
\[ \underset{\boldsymbol{\alpha}}{\mathrm{argmax}}\, \underset{\boldsymbol{\beta}}{\mathrm{argmin}}\, \left(\frac{1}{2}\left\Vert\boldsymbol{\beta}\right\Vert_2^2 - \sum_{n=1}^N\alpha_n (y_n(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \beta_0) - 1)\right) \quad\textrm{such that}\quad\alpha_n \ge 0. \]
(See Hastie, Tibshirani, and Friedman (2009) 4.5.) Setting to zero the derivatives with respect to \(\boldsymbol{\beta}\) and \(\beta_0\), respectively, gives (for a given \(\boldsymbol{\alpha}\)),
\[ \begin{aligned} \hat{\boldsymbol{\beta}}- \sum_{n=1}^N\alpha_n y_n \boldsymbol{x}_n ={}& 0 & \sum_{n=1}^N\alpha_n y_n ={}& 0. \end{aligned} \]
These identities must hold for any \(\boldsymbol{\alpha}\), so we can plug in to get the dual objective:
\[ \begin{aligned} \mathscr{L}(\boldsymbol{\alpha}) :={}& \frac{1}{2}\left\Vert\sum_{n=1}^N\alpha_n y_n \boldsymbol{x}\right\Vert_2^2 - \sum_{n=1}^N\alpha_n \left(y_n \left(\left(\sum_{n'=1}^N\alpha_{n'} y_{n'} \boldsymbol{x}_{n'} \right)^\intercal\boldsymbol{x}_n + \beta_0 \right) - 1 \right) \\={}& \frac{1}{2}\sum_{n=1}^N\sum_{n'=1}^N\alpha_n \alpha_{n'} y_n y_{n'} \boldsymbol{x}_n^\intercal\boldsymbol{x}_{n'} - \sum_{n=1}^N\sum_{n'=1}^N\alpha_n \alpha_{n'} y_n y_{n'} \boldsymbol{x}_{n} ^\intercal\boldsymbol{x}_{n'} + \beta_0 \sum_{n'=1}^N\alpha_{n'} y_{n'} + \sum_{n=1}^N\alpha_n \\={}& \sum_{n=1}^N\alpha_n -\frac{1}{2}\sum_{n=1}^N\sum_{n'=1}^N\alpha_n \alpha_{n'} y_n y_{n'} \boldsymbol{x}_n^\intercal\boldsymbol{x}_{n'}. \end{aligned} \]
We thus find our hyperplane by solving
\[ \begin{aligned} \hat{\boldsymbol{a}}:={}& \underset{\boldsymbol{\alpha}}{\mathrm{argmax}}\, \left( \sum_{n=1}^N\alpha_n -\frac{1}{2}\sum_{n=1}^N\sum_{n'=1}^N\alpha_n \alpha_{n'} y_n y_{n'} \boldsymbol{x}_n^\intercal\boldsymbol{x}_{n'} \right) \textrm{ subject to} \\ & \alpha_n \ge{} 0 \textrm{ and } \sum_{n=1}^N\alpha_n y_n ={} 0. \end{aligned} \]
This is a linearly constrained quadratic optimization problem that can be solved Once we have a solution \(\hat{\boldsymbol{a}}\), we can plug into the first–order conditions to compute the \(\hat{\boldsymbol{\beta}}\), and then plug into the final constraint to find \(\hat{\beta}_0\). Since the solution will satisfy \(\alpha_n(y_n(\hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}+ \hat{\beta}_0) - 1) ={} 0\), the \(\alpha_n\) that are non–zero identify the “support vectors” that are on the margin; it is only these vectors that determine the classifier.
Importantly, by eliminating \(\hat{\boldsymbol{\beta}}\), the dual of the SVC algorithm is formulated only in terms of the inner products \(\boldsymbol{x}_n^\intercal\boldsymbol{x}_{n'}\). We can thus replace this inner product with our kernel function. This is such a big step that it gets a new name: a “support vector machine” (SVM). For a particular kernel \(\mathcal{K}_{\,}(\cdot, \cdot)\) which evaluates our inner product,
\[ \begin{aligned} \hat{\boldsymbol{a}}:={}& \underset{\boldsymbol{\alpha}}{\mathrm{argmax}}\, \left( \sum_{n=1}^N\alpha_n -\frac{1}{2}\sum_{n=1}^N\sum_{n'=1}^N\alpha_n \alpha_{n'} y_n y_{n'} \mathcal{K}_{\,}(\boldsymbol{x}_n, \boldsymbol{x}_{n'}) \right) \textrm{ subject to} \\ & \alpha_n \ge{} 0 \textrm{ and } \sum_{n=1}^N\alpha_n y_n ={} 0. \end{aligned} \]
Note that we do not even need to evaluate \(\hat{\boldsymbol{\beta}}\) in this setting, since we only really use \(\hat{\boldsymbol{\beta}}\) to predict a new datapoint, \(\boldsymbol{x}_{\mathrm{new}}\), and the first–order conditions require that
\[ \boldsymbol{x}_{\mathrm{new}}^\intercal\hat{\boldsymbol{\beta}}= \boldsymbol{x}_{\mathrm{new}}^\intercal\sum_{n=1}^N\hat{\alpha}_n y_n \boldsymbol{x}_n = \sum_{n=1}^N\hat{\alpha}_n y_n \boldsymbol{x}_{\mathrm{new}}^\intercal\boldsymbol{x}_n = \sum_{n=1}^N\hat{\alpha}_n y_n \mathcal{K}_{\,}(\boldsymbol{x}_{\mathrm{new}}, \boldsymbol{x}_n). \]
So we have expressed our problem entirely in terms of the kernel, eliminating our inner products altogether.
Slack variables
Following Hastie, Tibshirani, and Friedman (2009) 12.2, we can allow some violations of the margin by slightly modifying the SVC formulation. Suppose we allow \(N\) different “slack” variables \(\varepsilon_n = 1 - y_n(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \beta_0)\). Note that enforcing \(\varepsilon_n \le 0\) for all \(n\) is the same as enforcing \(y_n(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \beta_0) \ge 1\). If we allow some of the \(\varepsilon_n\) to be positive, then we can allow some margin violations, which we can try to make as small as possible. This can be written as
\[ \mathscr{L}(\boldsymbol{\beta}) = \frac{1}{2} \left\Vert\boldsymbol{\beta}\right\Vert_2^2 + C \sum_{n=1}^N(\varepsilon_n)^+ = \frac{1}{2} \left\Vert\boldsymbol{\beta}\right\Vert_2^2 + C \sum_{n=1}^N(1 - y_n(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \beta_0))^+, \]
where
\[ (\varepsilon)^+ = \begin{cases} \varepsilon & \textrm{ when }\varepsilon \ge 0 \\ 0 & \textrm{ when }\varepsilon < 0 \\ \end{cases} \]
is the “hinge loss” or ReLU (rectified linear unit) function. This can be written equivalently in a SVC–like formulation by minimizing a positive upper bound on \(\varepsilon_n\) (see Hastie, Tibshirani, and Friedman (2009) eq 12.8 in section 12.2.1), which leads to the dual
\[ \begin{aligned} \mathscr{L}(\boldsymbol{\alpha}) ={}& \sum_{n=1}^N\alpha_n - \frac{1}{2} \sum_{n=1}^N\sum_{n'=1}^N\alpha_n \alpha_{n'} y_n y_{n'} \boldsymbol{x}_{n}^\intercal\boldsymbol{x}_{n'} \\ &\textrm{ such that }0\le \alpha_n \le C \textrm{ and }\sum_{n=1}^N\alpha_n y_n = 0. \end{aligned} \]
From this it follows again that, for any \(\boldsymbol{\alpha}\), \(\hat{\boldsymbol{\beta}}= \sum_{n=1}^N\alpha_n y_n \boldsymbol{x}_n\), and so \[ \left\Vert\hat{\boldsymbol{\beta}}\right\Vert_2^2 = \sum_{n=1}^N\sum_{n'=1}^N\alpha_n^2 y_n y_{n'} \boldsymbol{x}_n^\intercal\boldsymbol{x}_{n'} \quad\textrm{and}\quad 1 - y_n(\hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}_n + \beta_0) = 1 - y_n \left(\sum_{n'=1}^N\alpha_{n'} y_{n'} \boldsymbol{x}_{n'}^\intercal\boldsymbol{x}_n + \beta_0 \right). \]
is measured entirely in terms of the inner product \(\boldsymbol{x}_n^\intercal\boldsymbol{x}_{n'}\), which can be kernelized. Writing this way motivates thinking of \(f(\boldsymbol{x})\) as living in the class of functions \(\mathcal{F}:= \{ f: f(\boldsymbol{x}) = \boldsymbol{\beta}^\intercal\boldsymbol{x}+ \beta_0\}\), whose “size” is measured by \(\left\Vert\boldsymbol{\beta}\right\Vert_2^2\) under the identity above.
Given this, one might imagine changing the ReLU function for others. In particular, the function \(\phi(f) = \log(1 + \exp(-y f))\) gives rise to the familiar logistic loss under the transformation from \(y\in \{-1, 1\}\) to \(b \in \{0,1\}\) given by \(b = \frac{1}{2} y+ \frac{1}{2}\), since
\[ \begin{aligned} \phi(f) ={}& b \log(1 + \exp(-f)) + (1 - b) \log(1 + \exp(f)) \\={}& b \log\left(\frac{1 + \exp(-f)}{1 + \exp(f)}\right) + \log(1 + \exp(f)) \\={}& b \log\exp(-f) \left(\frac{1 + \exp(f)}{1 + \exp(f)}\right) + \log(1 + \exp(f)) \\={}& - b f+ \log(1 + \exp(f)). \end{aligned} \]
At this point, this only tentatively suggests the possibility of generalizing the SVM kerneli trick to logistic regression by considering \(\left\Vert\boldsymbol{\beta}\right\Vert_2^2\) a “size” of \(f\) induced by the kernel. We will make this connection careful in the coming lectures.
The kernel trick
In these examples, we went from a particular feature to a kernel function. However, upon inspection there are some funny things about this feature space. You might ask whether, in general, you can take a particular featurization and turn it into a computationally efficient kernel. The answer appears to be no, in general.
But that may be the wrong question. One might instead ask: why start with a featurization at all? Why not start with a computationally efficient kernel, and simply replace the dot product with that kernel? What kinds of kernel functions can you use? What is the implicit featurization for a kernel? And are there other algorithms other than ridge that admit this trick? These are the questions we’ll answer in the coming unit.