Homework 5 (due May 3rd 9pm)

Stat 154/254: Statistical Machine Learning

1 Inner products

Recall that an inner product on a vector space must satisfy symmetry, linearity, and positive definitness. Show that each of the following inner products satisfy these properties.

(In each problem, you may assume without proof that the space is a valid vector space — you need only verify the properties of the inner product.)

(a) Dot product

Let \(\boldsymbol{x}\in \mathbb{R}^{D}\), and define \(\left<\boldsymbol{x}, \boldsymbol{s}\right> := \boldsymbol{x}^\intercal\boldsymbol{s}\).

(b) Positive definite matrices

Let \(\boldsymbol{x}\in \mathbb{R}^{D}\), let \(\boldsymbol{A}\) denote a symmetric positive definite matrics, and define \(\left<\boldsymbol{x}, \boldsymbol{s}\right>_\boldsymbol{A}:= \boldsymbol{x}^\intercal\boldsymbol{A}\boldsymbol{s}\).

(c) Infinite series

Let \(\mathcal{X}= \{ \boldsymbol{x}= (x_1, x_2, \ldots) \textrm{ such that } \sum_{n=1}^\infty x_n^2 < \infty \}\). and let \(\left<\boldsymbol{x}, \boldsymbol{s}\right> = \sum_{n=1}^\infty x_n s_n\).

(d) Continuous functions

Let \(\mu(\boldsymbol{x})\) be nonzero on some bounded set \(\Omega\) so that \(\int_A \mu(\boldsymbol{x}) d \boldsymbol{x}> 0\) for every open set \(A \subseteq \Omega\). Let \(\mathcal{F}= \{ f: \mathcal{X}\mapsto \mathbb{R}^{} \textrm{ such that } f\textrm{ is continuous and } \int f(\boldsymbol{x})^2 \mu(\boldsymbol{x}) d\boldsymbol{x}< \infty \}\), and let \(\left<f, g\right>_\mu := \int f(\boldsymbol{x}) g(\boldsymbol{x}) \mu(\boldsymbol{x}) d \boldsymbol{x}\).

(e) Covariances

Let \(\mathcal{F}= \{ f: \mathcal{X}\mapsto \mathbb{R}^{} \textrm{ such that } f\textrm{ is continuous and } \mathbb{E}_{\,}\left[f(\boldsymbol{x})^2\right] < \infty \}\) where \(\boldsymbol{x}\) is a random variable, and \(\left<f, g\right>_\mathcal{X}= \mathbb{E}_{\boldsymbol{x}}\left[f(\boldsymbol{x}) g(\boldsymbol{x})\right]\). You may assume that every set in the domain of \(\boldsymbol{x}\) has positive probability.

(f) RKHS inner products

Let \(\mathcal{K}_{\,}(\cdot, \cdot)\) denote a symmetric, positive definite kernel, let \(\mathcal{H}= \{ f: \mathcal{X}\mapsto\mathbb{R}^{} \textrm{ such that } f(\boldsymbol{x}) = \sum_{k=1}^K a_k \mathcal{K}_{\,}(\boldsymbol{x}, \boldsymbol{x}_k) \textrm{ for some }\boldsymbol{a},\boldsymbol{x}_1\ldots,\boldsymbol{x}_K\}\), and let \(\left<\cdot,\cdot\right>_\mathcal{H}\) denote the usual RKHS inner product.

Solutions

For each solution, we prove (in order) symmetry, linearity, and positive–definiteness.

(a) Dot product

\(\left<\boldsymbol{x}, \boldsymbol{s}\right> = \boldsymbol{x}^\intercal\boldsymbol{s}= \boldsymbol{x}^\intercal\boldsymbol{x}= \left<\boldsymbol{s}, \boldsymbol{x}\right>\)
\(\left<a \boldsymbol{x}+ \boldsymbol{y}, \boldsymbol{s}\right> = \left( a \boldsymbol{x}+ \boldsymbol{y}\right)^\intercal\boldsymbol{s}= a \boldsymbol{x}^\intercal\boldsymbol{s}+ \boldsymbol{y}^\intercal\boldsymbol{s}= a \left<\boldsymbol{x}, \boldsymbol{s}\right> + \left<\boldsymbol{y}, \boldsymbol{s}\right>\)
\(\left<\boldsymbol{x}, \boldsymbol{x}\right> = \boldsymbol{x}^\intercal\boldsymbol{x}= \sum_{p=1} \boldsymbol{x}_p^2,\) which is zero if and only if \(\boldsymbol{x}_p = 0\) for each \(p\).

(b) Positive definite matrices

\(\left<\boldsymbol{x}, \boldsymbol{s}\right>_\boldsymbol{A}= \boldsymbol{x}^\intercal\boldsymbol{A}\boldsymbol{s}= (\boldsymbol{x}^\intercal\boldsymbol{A}\boldsymbol{s})^\intercal= \boldsymbol{s}^\intercal\boldsymbol{A}^\intercal\boldsymbol{x}= \boldsymbol{s}^\intercal\boldsymbol{A}\boldsymbol{x}= \left<\boldsymbol{s}, \boldsymbol{x}\right>_\boldsymbol{A}\)
No real difference from (a)
Let \(\boldsymbol{e}_p,\lambda_p\) denote the eigen decomposition of \(\boldsymbol{A}\). \(\lambda_p > 0\) by positive definiteness. Then \(\boldsymbol{x}= \sum_{p} (\boldsymbol{x}^\intercal\boldsymbol{e}_p) \boldsymbol{e}_p\), and

\[ \begin{aligned} \left<\boldsymbol{x}, \boldsymbol{s}\right>_\boldsymbol{A} ={}& \boldsymbol{x}^\intercal\boldsymbol{A}\boldsymbol{x} \\={}& \sum_p \sum_{p'} \boldsymbol{e}_{p}^\intercal\boldsymbol{A}\boldsymbol{e}_{p'} (\boldsymbol{x}^\intercal\boldsymbol{e}_p) (\boldsymbol{x}^\intercal\boldsymbol{e}_{p'}) \sum_p \sum_{p'} \boldsymbol{e}_{p}^\intercal\boldsymbol{A}\boldsymbol{e}_{p'} (\boldsymbol{x}^\intercal\boldsymbol{e}_p) (\boldsymbol{x}^\intercal\boldsymbol{e}_{p'}) \\={}& \sum_p \sum_{p'} \lambda_{p'} \boldsymbol{e}_{p}^\intercal\boldsymbol{e}_{p'} (\boldsymbol{x}^\intercal\boldsymbol{e}_p) (\boldsymbol{x}^\intercal\boldsymbol{e}_{p'}) \\={}& \sum_p \lambda_{p} (\boldsymbol{x}^\intercal\boldsymbol{e}_p)^2, \end{aligned} \]

which is zero if and only if \(\boldsymbol{x}^\intercal\boldsymbol{e}_p = 0\) for each \(p\), and the latter can happen only if \(\boldsymbol{x}= \boldsymbol{0}\) because \(\boldsymbol{e}_p\) are a basis.

(c) Infinite series

Symmetry follows from \(x_n s_n = s_n x_n\).
First, we show that the inner product series is absolutely convergent, since by Cauchy–Schwartz \(\sum_{n=1}^N \left|x_n s_n\right| \le \sqrt{\sum_{n=1}^N|x_n|^2 \sum_{n=1}^N|s_n|^2}\), and by assumption the right hand side converges as \(N \rightarrow \infty\). Then, for absolutely convergent series, \(\sum_{n=1}^N(x_n + y_n)s_n = \sum_{n=1}^N(x_n s_n + y_n s_n) = \sum_{n=1}^Nx_n s_n + \sum_{n=1}^Ny_n s_n\). For any convergent series, \(a \sum_{n=1}^Nx_n = \sum_{n=1}^Na x_n\). The result follows as in (a).
\(\left<\boldsymbol{x}, \boldsymbol{x}\right> = \sum_{n=1}^\infty x_n^2\), which is zero if and only if \(x_n = 0\) for each \(n\).

(d) Continuous functions

Symmetry follows as in (a) and from \(f(\boldsymbol{x}) g(\boldsymbol{x}) = g(\boldsymbol{x}) f(\boldsymbol{x})\).
First, by Cauchy–Schwartz. Then linearity follows from standard properties of the integral on bounded sets: \(\int a f(\boldsymbol{x}) \mu(\boldsymbol{x}) d\boldsymbol{x}= a \int f(\boldsymbol{x}) \mu(\boldsymbol{x}) d\boldsymbol{x}\) and \(\int (f(\boldsymbol{x}) + g(\boldsymbol{x})) \mu(\boldsymbol{x}) d\boldsymbol{x}= \int f(\boldsymbol{x}) \mu(\boldsymbol{x}) d\boldsymbol{x}+ \int g(\boldsymbol{x}) \mu(\boldsymbol{x}) d\boldsymbol{x}\).
Suppose that \(f(\boldsymbol{x}) = f_0 \ne 0\) is not zero at some point \(\boldsymbol{x}_0\). Then, by continuity, there exists a set \(A\) with nonzero area around \(\boldsymbol{x}_0\) within which \(f(\boldsymbol{x}) \ge f_0 / 2\), and so \(\left<f, f\right>_\mu = \int f(\boldsymbol{x})^2 \mu(\boldsymbol{x}) d\boldsymbol{x}\ge \int_A f(\boldsymbol{x})^2 \mu(\boldsymbol{x}) d\boldsymbol{x}\ge \frac{f_0}{2} \int_A \mu(\boldsymbol{x}) d\boldsymbol{x}> 0\). Conversely, if \(f(\boldsymbol{x}) = 0\), then \(\left<f, f\right>_\mu = \int f(\boldsymbol{x})^2 \mu(\boldsymbol{x}) d\boldsymbol{x}= 0\).

(e) Covariances

This is equivalent to (d) if \(\mu(\cdot)\) is taken to be the probability density of \(\boldsymbol{x}\), if necessary after transformation to a bounded set (e.g. \(\boldsymbol{z}= \mathrm{arctan}(\boldsymbol{x})\) componentwise).

(f) RKHS inner products

The kernel is symmetric, so \(\left<f, g\right>_\mathcal{H}= \sum_{k=1}^K \sum_{j=1}^J a_k b_j \mathcal{K}_{\,}(\boldsymbol{x}_k, \boldsymbol{x}_j) = \sum_{j=1}^J \sum_{k=1}^K b_j a_k \mathcal{K}_{\,}(\boldsymbol{x}_j, \boldsymbol{x}_k) = \left<g, f\right>_\mathcal{H}\).
For linearity,

\[ \begin{aligned} \left<c f, g\right>_\mathcal{H}={}& \left<c \sum_{k=1}^K a_k \mathcal{K}_{\,}(\cdot, \boldsymbol{x}_k), \sum_{j=1}^J b_j \mathcal{K}_{\,}(\cdot, \boldsymbol{x}_j)\right>_\mathcal{H} \\={}& \sum_{k=1}^K \sum_{j=1}^J c a_k b_j \left<\mathcal{K}_{\,}(\cdot, \boldsymbol{x}_k), \mathcal{K}_{\,}(\cdot, \boldsymbol{x}_j)\right>_\mathcal{H} \\={}& \sum_{k=1}^K \sum_{j=1}^J c a_k b_j \mathcal{K}_{\,}(\boldsymbol{x}_k, \boldsymbol{x}_j) \\={}& c \sum_{k=1}^K \sum_{j=1}^J a_k b_j \mathcal{K}_{\,}(\boldsymbol{x}_k, \boldsymbol{x}_j) \\={}& c \left<f, g\right>_\mathcal{H}. \end{aligned} \]

Also, by definition,

\[ \begin{aligned} \left<f, g+ h\right>_\mathcal{H} ={}& \left<\sum_{k=1}^K a_k \mathcal{K}_{\,}(\cdot, \boldsymbol{x}_k), \sum_{j=1}^J b_j \mathcal{K}_{\,}(\cdot, \boldsymbol{x}_j) + \sum_{m=1}^M c_m \mathcal{K}_{\,}(\cdot, \boldsymbol{x}_m)\right>_\mathcal{H} \\={}& \sum_{k=1}^K \sum_{j=1}^J b_j a_k \left<\mathcal{K}_{\,}(\cdot, \boldsymbol{x}_k), \mathcal{K}_{\,}(\cdot, \boldsymbol{x}_j)\right>_\mathcal{H}+ \sum_{k=1}^K \sum_{m=1}^M a_k c_m \left<\mathcal{K}_{\,}(\cdot, \boldsymbol{x}_k), \mathcal{K}_{\,}(\cdot, \boldsymbol{x}_m)\right>_\mathcal{H} \\={}& \left<\sum_{k=1}^K a_k \mathcal{K}_{\,}(\cdot, \boldsymbol{x}_k), \sum_{j=1}^J b_j \mathcal{K}_{\,}(\cdot, \boldsymbol{x}_j)\right>_\mathcal{H}+ \left<\sum_{k=1}^K a_k \mathcal{K}_{\,}(\cdot, \boldsymbol{x}_k), \sum_{m=1}^M c_m \mathcal{K}_{\,}(\cdot, \boldsymbol{x}_m)\right>_\mathcal{H} \\={}& \left<f, g\right>_\mathcal{H}+ \left<f, h\right>_\mathcal{H}. \end{aligned} \]

If \(f(\boldsymbol{x}) = 0\) for all \(\boldsymbol{x}\), then by the reproducing property

\[ \begin{aligned} \left<f, f\right>_\mathcal{H}={}& \sum_{k=1}^K a_k \left<\mathcal{K}_{\,}(\cdot, \boldsymbol{x}_k), f\right> = \sum_{k=1}^K a_k f(\boldsymbol{x}_k) = 0. \end{aligned} \]

Conversely, by Cauchy–Schwartz (which holds for general inner product spaces), if \(\left<f, f\right>_\mathcal{H}= 0\) then

\[ \left|f(x)\right|^2 = \left|\left<\mathcal{K}_{\,}(\cdot, x), f\right>_\mathcal{H}\right|^2 \le \mathcal{K}_{\,}(x, x) \left<f, f\right>_\mathcal{H}, \]

and so \(f(x) = 0\) for all \(x\).

2 Regression as an RKHS

Suppose we observe IID pairs \((\boldsymbol{z}, y)\), and have defined a feature map \(\phi:\mathcal{Z}\mapsto \mathbb{R}^{P}\) for regression. Let

\[ \mathcal{F}=\{f:\mathcal{Z}\mapsto \mathbb{R}^{} \textrm{ such that }f(\boldsymbol{z}) = \phi(\boldsymbol{z})^\intercal\boldsymbol{\beta} \textrm{ for some } \boldsymbol{\beta}\in \mathbb{R}^{P}\} \]

denote the class of regression functions. We will express this class of functions as an RKHS using the kernel \(\mathcal{K}_{\,}(\boldsymbol{z}, \boldsymbol{z}') = \phi(\boldsymbol{z})^\intercal\phi(\boldsymbol{z}')\).

You may assume that the second moments of \(\phi(\boldsymbol{z})\) are finite, i.e. that \(\mathbb{E}_{\mathcal{Z}}\left[\phi(\boldsymbol{z}) \phi(\boldsymbol{z})^\intercal\right] < \infty\).

(a)

Suppose that the final component of \(\phi(\boldsymbol{z})\), \(\phi_P(\boldsymbol{z})\), can be written as a linear combination of the others. In other words, suppose there exists \(a_1, \ldots, a_{P-1}\) such that \(\phi_P(\boldsymbol{z}) = a_1 \phi_1(\boldsymbol{z}) + \ldots + a_{P-1}\phi_{P-1}(\boldsymbol{z})\) for all \(\boldsymbol{z}\). Define a class \(\mathcal{F}'\) containing the same function as \(\mathcal{F}\) but using only the first \(P-1\) components of \(\phi(\boldsymbol{z})\).

Given this result, we will assume from now on that we have written \(\mathcal{F}\) using the minimal dimension \(P\), and that we cannot write one component of \(\phi\) as a linear combination of the other components for all \(\boldsymbol{z}\).

(b)

Let

\[ \mathcal{H}= \{ f:\mathcal{Z}\mapsto \mathbb{R}^{} \textrm{ such that } f(\boldsymbol{z}) = \sum_{k=1}^K a_k \mathcal{K}_{\,}(\boldsymbol{z}, \boldsymbol{z}_k) \textrm{ for some }\boldsymbol{a}, \boldsymbol{z}_1,\ldots,\boldsymbol{z}_K \}, \]

and let \(\left<\cdot,\cdot\right>_\mathcal{H}\) denote the usual RKHS inner product. Show that \(f\in \mathcal{H}\) if and only if \(f\in \mathcal{F}\).

(c)

Suppose \(f(\cdot; \boldsymbol{\beta}) \in \mathcal{F}\) means \(f(\boldsymbol{z}; \boldsymbol{\beta}) = \phi(\boldsymbol{z})^\intercal\boldsymbol{\beta}\). Show that \(\left<f(\boldsymbol{z}; \boldsymbol{\beta}), f(\boldsymbol{z}; \boldsymbol{\beta}')\right>_\mathcal{H}= \boldsymbol{\beta}^\intercal\boldsymbol{\beta}'\).

(d)

Define the expectation inner product

\[ \left<f(\boldsymbol{z}; \boldsymbol{\beta}), f(\boldsymbol{z}; \boldsymbol{\beta}')\right>_{\mathcal{Z}} = \mathbb{E}_{\mathcal{Z}}\left[f(\boldsymbol{z}; \boldsymbol{\beta}) f(\boldsymbol{z}; \boldsymbol{\beta}')\right]. \]

Under what conditions are the two inner products equal? That is, when does \(\left<f(\boldsymbol{z}; \boldsymbol{\beta}), f(\boldsymbol{z}; \boldsymbol{\beta}')\right>_{\mathcal{Z}} = \left<f(\boldsymbol{z}; \boldsymbol{\beta}), f(\boldsymbol{z}; \boldsymbol{\beta}')\right>_\mathcal{H}\) for all \(f\)?

(e)

Show that the solution to the following two problems give the same regression function:

\[ \underset{\boldsymbol{\beta}\in \mathbb{R}^{P}}{\mathrm{argmin}}\, \left(\sum_{n=1}^N(y_n - \boldsymbol{\beta}^\intercal\phi(\boldsymbol{z}_n))^2 + \left\Vert\boldsymbol{\beta}\right\Vert_2^2 \right) \quad\textrm{and}\quad \underset{f\in \mathcal{H}}{\mathrm{argmin}}\, \left(\sum_{n=1}^N(y_n - f(\boldsymbol{z}_n))^2 + \left\Vert f\right\Vert_{\mathcal{H}}^2 \right). \]

You may assume for simplicity that the matrix \(\boldsymbol{X}\) formed from the observations \(\boldsymbol{x}_n = \phi(\boldsymbol{z}_n)\) is full–rank, and that \(P < N\).

(f)

In light of (d) and (e), comment on an RKHS interpretation of the practice of standardizing regressors before performing ridge regression.

Solutions

(a)

If \(f\in \mathcal{F}\) and \(a_P \ne 0\), then we can write (for any \(\boldsymbol{z}\)),

\[ \begin{aligned} f(\boldsymbol{z}) ={}& \sum_{p=1}^{P-1} b_p \phi_p(\boldsymbol{z}) + b_P \phi_P(\boldsymbol{z}) \\={}& \sum_{p=1}^{P-1} b_p \phi_p(\boldsymbol{z}) + \sum_{p=1}^{P-1} a_p \phi_p(\boldsymbol{z}) \\={}& \sum_{p=1}^{P-1} (a_p + b_p) \phi_p(\boldsymbol{z}), \end{aligned} \]

which is in \(\mathcal{F}'\).

(b)

Note that, for any \(\boldsymbol{z}\), \(\mathcal{K}_{\,}(\cdot, \boldsymbol{z}) = \sum_{p=1}^P \phi_p(\cdot) \phi_p(\boldsymbol{z})\). So if \(f\in \mathcal{H}\), then, for any \(\boldsymbol{z}\), \[ f(\boldsymbol{z}) = \sum_{k=1}^K a_k \mathcal{K}_{\,}(\boldsymbol{z}, \boldsymbol{z}_k) = \sum_{k=1}^K a_k \sum_{p=1}^P \phi_p(\boldsymbol{z}) \phi_p(\boldsymbol{z}_k) = \sum_{p=1}^P \left(\sum_{k=1}^K a_k \phi_p(\boldsymbol{z}_k) \right) \phi_p(\boldsymbol{z}), \] which is in \(\mathcal{F}\) with coefficients \(\beta_p = \left(\sum_{k=1}^K a_k \phi_p(\boldsymbol{z}_k) \right)\). This can be written as \(\boldsymbol{\beta}= \Phi_\boldsymbol{Z}\boldsymbol{a}\), where \(\boldsymbol{a}\) is a length–\(K\) vector with components \(a_k\), \(\boldsymbol{\beta}\) a \(P\)–vector with elements \(\beta_p\), and \(\Phi_\boldsymbol{Z}\) is a \(P \times K\) matrix with entries \(\phi_p(\boldsymbol{z}_k)\).

Using the same computation, if \(f\in \mathcal{F}\), then to show that \(f\in \mathcal{H}\), it suffices to find \(a_k\) and \(\boldsymbol{z}_k\) such that \(\sum_{k=1}^K a_k \phi_p(\boldsymbol{z}_k) = \beta_p\) for each \(p\). By (a), we may safely assume that we can find set of \(p\) vectors \(\boldsymbol{z}_k\) such that the matrix \(\Phi_\boldsymbol{Z}\) is a \(P\times P\) full-rank matrix (otherwise we reduce the space of features without changing the membership of \(\mathcal{H}\) and \(\mathcal{F}\)). Then finding the appropriate \(a_p\) is equivalent to solving the linear system \(\Phi_\boldsymbol{Z}\boldsymbol{a}= \boldsymbol{\beta}\), which is possible because \(\Phi_\boldsymbol{Z}\) is invertible by construction.

(c)

By (b), we know that \(f\in \mathcal{H}\) means \(f= \Phi_\boldsymbol{Z}\boldsymbol{\beta}\), where \(\boldsymbol{\beta}= \Phi_\boldsymbol{Z}\boldsymbol{a}\) for some \(\boldsymbol{Z}\) and \(\boldsymbol{a}\) and invertible \(\Phi_\boldsymbol{Z}\). In general there will be many \(\boldsymbol{a}\) and \(\boldsymbol{Z}\) that give rise to the same \(\boldsymbol{\beta}\), but they all represent the same function, and this function will be \(f(\cdot; \boldsymbol{\beta})\). Without loss of generality, we can represent \(f(\cdot; \beta)\) and \(f(\cdot; \boldsymbol{\beta}')\) using the same \(\boldsymbol{Z}\). Then

\[ \begin{aligned} \left<f(\cdot, \boldsymbol{\beta}), f(\cdot; \boldsymbol{\beta}')\right>_\mathcal{H}={}& \left<f(\cdot, \sum_{p=1}^P \mathcal{K}_{\,}(\cdot, \boldsymbol{z}_p) a_p), \sum_{p=1}^P \mathcal{K}_{\,}(\cdot, \boldsymbol{z}_p) a'_p\right>_\mathcal{H}\\={}& \boldsymbol{a}^\intercal\mathcal{K}_{\boldsymbol{Z}} \boldsymbol{a}' \\={}& \boldsymbol{a}^\intercal\Phi_\boldsymbol{Z}^\intercal\Phi_\boldsymbol{Z}\boldsymbol{a}' \\={}& \boldsymbol{\beta}^\intercal\boldsymbol{\beta}'. \end{aligned} \]

(d)

Directly computing,

\[ \left<f(\boldsymbol{z}; \boldsymbol{\beta}), f(\boldsymbol{z}; \boldsymbol{\beta}')\right>_{\mathcal{Z}} = \mathbb{E}_{\mathcal{Z}}\left[\boldsymbol{\beta}^\intercal\phi(\boldsymbol{z}) \phi(\boldsymbol{z})^\intercal\boldsymbol{\beta}\right] = \boldsymbol{\beta}^\intercal\mathrm{Cov}_{\mathcal{Z}}\left(\phi(\boldsymbol{z})\right) \boldsymbol{\beta}. \]

This is the same inner product as \(\left<\cdot, \cdot\right>_\mathcal{H}\) if and only if \(\mathrm{Cov}_{\mathcal{Z}}\left(\phi(\boldsymbol{z})\right) = \boldsymbol{I}_{P}\). If the feature vectors \(\phi(\boldsymbol{z})\) are mean zero, this implies that they are unit variance uncorrelated random variables.

(e)

By the representer theorem, the solution to the second problem takes the form \[ \hat{f}(\cdot) = \sum_{n=1}^N a_n \mathcal{K}_{\,}(\cdot, \boldsymbol{z}_n) = \sum_{n=1}^N a_n \sum_{p=1}^P \phi_p(\boldsymbol{z}_n) \phi_p(\cdot) = \sum_{p=1}^P \left(\sum_{n=1}^N a_n \phi_p(\boldsymbol{z}_n) \right) \phi_p(\cdot) = \boldsymbol{a}^\intercal\Phi_\boldsymbol{Z}^\intercal\phi(\cdot) \] for some \(a_n\), and where \(\boldsymbol{Z}\) now contains the observed \(\boldsymbol{z}_n\) in its rows. Let \(\boldsymbol{X}= \Phi_Z^\intercal\) denote the \(N \times P\) matrix of regressors \(\boldsymbol{x}_n = \phi(\boldsymbol{z}_n)\). It follows that the predicted values \(\hat{f}(\boldsymbol{z}_n)\) for \(n=1,\ldots,N\) are of the form \(\boldsymbol{a}^\intercal\Phi_\boldsymbol{Z}^\intercal\Phi_\boldsymbol{Z}= \boldsymbol{X}\boldsymbol{X}^\intercal\boldsymbol{a}\). Since \(P < N\) and \(\boldsymbol{X}\) is full–rank, for any \(\boldsymbol{\beta}\) you can find \(\boldsymbol{a}\) such that \(\boldsymbol{\beta}= \boldsymbol{X}^\intercal\boldsymbol{a}\), so the set of potential solutions evaluated at the data is of the form \(f(\boldsymbol{x}_n) = \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\), and the norm \(\left\Vert f\right\Vert_\mathcal{H}= \boldsymbol{a}^\intercal\boldsymbol{X}\boldsymbol{X}^\intercal\boldsymbol{a}= \boldsymbol{\beta}^\intercal\boldsymbol{\beta}\).

The result follows by plugging in.

(f)

Before doing ridge regression, we often transform the variables so that \(\sum_{n=1}^N \phi_p(\boldsymbol{z}_n) = 0\) and \(\frac{1}{N} \sum_{n=1}^N \phi_p(\boldsymbol{z}_n)^2 = 1\). This is the same as fixing the diagonals of \(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\) to be \(1\), and by the LLN \(\frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\approx \mathrm{Cov}_{\,}\left(\phi(\boldsymbol{x})\right)\). This sets the scale of the penalization of each parameter to roughly match the scale of penalizing the regression in the covariance space \(\left<f(\boldsymbol{z}; \boldsymbol{\beta}), f(\boldsymbol{z}; \boldsymbol{\beta}')\right>_{\mathcal{Z}}\). Since \(\mathbb{E}_{\boldsymbol{Z}}\left[f(\boldsymbol{z}; \boldsymbol{\beta})^2\right]\) is an intuitively meaningful measure of the “size” of \(f\), scaling the variables in this way causes the scale of the ridge penalty to coincide roughly with the scale of the intuitively meaningful covariance inner product \(\left<\cdot, \cdot\right>_{\mathcal{Z}}\).

3 Combining kernels

A key way to produce valid kernels to is combine and modify kernels known to be positive definite.

Suppose that \(\mathcal{K}_{1}\) and \(\mathcal{K}_{2}\) are two symmetric positive definite kernels defined on the same space, \(\mathcal{Z}\times \mathcal{Z}\). Show that the following are also symmetric positive definite kernels.

(a)

\(\mathcal{K}_{C}(\boldsymbol{z}, \boldsymbol{s}) = C \mathcal{K}_{1}(\boldsymbol{z}, \boldsymbol{s})\) for \(C > 0\).

(b)

\(\mathcal{K}_{+}(\boldsymbol{z}, \boldsymbol{s}) = \mathcal{K}_{1}(\boldsymbol{z}, \boldsymbol{s}) + \mathcal{K}_{2}(\boldsymbol{z}, \boldsymbol{s})\).

(c)

\(\mathcal{K}_{\times}(\boldsymbol{z}, \boldsymbol{s}) = \mathcal{K}_{1}(\boldsymbol{z}, \boldsymbol{s}) \times \mathcal{K}_{2}(\boldsymbol{z}, \boldsymbol{s})\).

Hint: You can use the Schur product theorem.

(d)

Suppose that \(\phi: \mathcal{Z}\mapsto \mathcal{X}\) is an invertible map. Then define the kernel on \(\mathcal{X}\times \mathcal{X}\):

\(\mathcal{K}_{\phi}(\boldsymbol{x}, \boldsymbol{y}) = \mathcal{K}_{1}(\phi^{-1}(\boldsymbol{x}), \phi^{-1}(\boldsymbol{y}))\).

Solutions

In each case, the trick is to show that the gram matrix of the combined kernel inherits positive semi–definiteness from its consitutent kernels. For compactness, I’ll just use the symbol \(\boldsymbol{A}_{\cdot}\) to represent a generic gram matrix formed from kernel \(\mathcal{K}_{\cdot}\), with \(\boldsymbol{v}\) a vector of the same dimension. To show that \(\boldsymbol{A}_{\cdot}\) is positive semi–definite, it suffices to show that \(\boldsymbol{v}^\intercal\boldsymbol{A}_{\cdot} \boldsymbol{v}\ge 0\) for all \(\boldsymbol{v}\) and particular choice of \(\boldsymbol{A}\).

(a)

\(\boldsymbol{v}^\intercal\boldsymbol{A}_C \boldsymbol{v}= C (\boldsymbol{v}^\intercal\boldsymbol{A}\boldsymbol{v}) \ge 0.\)

(b)

\(\boldsymbol{v}^\intercal\boldsymbol{A}_+ \boldsymbol{v}= \boldsymbol{v}^\intercal\boldsymbol{A}_1 \boldsymbol{v}+ \boldsymbol{v}^\intercal\boldsymbol{A}_2 \boldsymbol{v}\ge 0\).

(c)

Each entry of \(\boldsymbol{A}_{\times}\) is the product of the corresponding entries of \(\boldsymbol{A}_1\) and \(\boldsymbol{A}_2\). That is, \(\boldsymbol{A}_{\times} = \boldsymbol{A}_1 \odot \boldsymbol{A}_2\), where \(\odot\) is the “Hadamard product.” By the Schur product theorem, the Hadamard product of positive semi–definite matrices is positive semi–definite. (The Wikipedia article proves it for positive definie matrices, but inspection of the proof shows that it holds for positive semi–definite matrices as well.)

(d)

The entry in the gram matrix \(\boldsymbol{A}_\phi\) evaluated at \(\boldsymbol{x}, \boldsymbol{y}\) is numerically equal to the entry in the gram matrix \(\boldsymbol{A}_1\) evaluated at \(\phi^{-1}(\boldsymbol{x}), \phi^{-1}(\boldsymbol{y})\). Since the \(\boldsymbol{A}_1\) gram matrix is PSD, \(\boldsymbol{A}_\phi\) (the same matrix) is PSD.

4 The ridgeless estimator and minimum norm interpolators

For a function \(f\in \mathcal{H}\) in an RKHS, let \(\hat{\mathscr{L}}(f) = \sum_{n=1}^N(y_n - f(\boldsymbol{x}_n))^2\). Recall the dual form of ridge regression: for any \(\lambda\), there exists a \(B(\lambda)\) (depending on \(\lambda\)) such that the following two problems are equivalent:

\[ \hat{f}_\lambda := \underset{f\in \mathcal{H}}{\mathrm{argmin}}\, \left( \hat{\mathscr{L}}(f) + \lambda \left\Vert f\right\Vert_\mathcal{H}^2 \right) \quad\textrm{and}\quad \underset{f: \left\Vert f\right\Vert_\mathcal{H}^2 \le B(\lambda)}{\mathrm{argmin}}\, \hat{\mathscr{L}}(f). \]

You may assume that the kernel \(\mathcal{K}_{\,}{}\) used to construct the RKHS has an infinite set of linearly independent eigenfunctions and that the gram matrix is full–rank with probability one.

(a)

Argue that, as \(\lambda \rightarrow \infty\), \(B(\lambda) \rightarrow 0\), and that as \(\lambda \rightarrow 0\), \(B(\lambda) \rightarrow \infty\). A simple heuristic argument is enough, you do not need to prove it formally.

(b)

Use the representer theorem to write out an explicit solution for \(\hat{f}_\lambda\).

(c)

Let \(\hat{f}_{0+}\) denote the limit of \(\hat{f}_\lambda\) as \(\lambda \rightarrow 0\). This is sometimes called the “ridgeless estimator.”

In general, does \(\hat{\mathscr{L}}(\hat{f}_\lambda) \rightarrow 0\) as \(\lambda \rightarrow 0\)? Hint: a simple argument uses the dual form of the objective function.

(d)

In general, there are many distinct functions \(f\in \mathcal{H}\) that interpolate the data, i.e., that satisfy \(f(\boldsymbol{x}_n) = y_n\) for all \(n=1, \ldots, N\), each with a different norm \(\left\Vert f\right\Vert_\mathcal{H}\). Among these, which \(f\) is selected by \(\hat{f}_{0+}\)?

(e)

If you had chosen a different kernel, and so a different RKHS, would have you have gotten the same answer for \(\hat{f}_{0+}\)?

Solutions

(a)

As \(\lambda \rightarrow \infty\), the penalty associated with large \(\left\Vert\boldsymbol{\beta}\right\Vert_2^2\) increases, and so the norm of the optimal solution goes to zero; this corresponds to \(B(\lambda) \rightarrow 0\). The reverse argument gives the other result.

(b)

We know that any solution takes the form \(f(\cdot) = \sum_{n=1}^Na_n \mathcal{K}_{\,}(\cdot, \boldsymbol{x}_n)\) for some \(\boldsymbol{x}_n\). Letting \(G = \mathcal{K}_{\boldsymbol{X}}\) denote the Gram matrix on the observed data, this means that \(\hat{\mathscr{L}}(f) = (\boldsymbol{Y}- G \boldsymbol{a})^\intercal(\boldsymbol{Y}- G \boldsymbol{a})\) and \(\left\Vert f\right\Vert_\mathcal{H}^2 = \boldsymbol{a}^\intercal G \boldsymbol{a}\). Then the first–order condition with respect to \(\boldsymbol{a}\) (recalling that \(G\) is symmetric) gives

\[ G G \hat{\boldsymbol{a}} - G \boldsymbol{Y}+ \lambda G \hat{\boldsymbol{a}} = \boldsymbol{0}\quad\Rightarrow\quad \hat{\boldsymbol{a}} = \left(G G + \lambda G \right)^{-1} G \boldsymbol{Y}= \left(G + \lambda \boldsymbol{I}_{\,}{N} \right)^{-1} \boldsymbol{Y} \]

(c)

Yes, \(\hat{\mathscr{L}}(\hat{f}_\lambda) \rightarrow 0\) as \(\lambda \rightarrow 0\), since as \(\lambda \rightarrow 0\), we are finding \(\inf{f\in \mathcal{H}} \sum_{n=1}^N(y_n - f(\boldsymbol{x}_n))^2\) with no norm constraint, and since there are an infinite number of linearly independent components, you can take \(f(\boldsymbol{x}_n) = y_n\) for all \(n\).

(d)

The ridgeless estimator \(\hat{f}_{0+}\) selects the interpolator with the smallest norm \(\left\Vert f\right\Vert_\mathcal{H}\) — the empirical loss among the interpolators is the same, but for \(\lambda\) very small but nonzero the norm is still penalized.

(e)

In general you would not by (d), because the RKHS norm would be different. The RKHS norm effectively determines which interpolator you choose.

5 Support vector machines as empirical risk minimization

(a) (Hastie, Tibshirani, and Friedman (2009) 12.1)

Show that the following two problems are equivalent ways of representing a support vector classifier with slack variables:

\[ \underset{\beta_0, \boldsymbol{\beta}}{\mathrm{argmin}}\, \left( \sum_{n=1}^N(1 - y_n (\boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \beta_0))_{+} + \frac{\lambda}{2} \left\Vert\boldsymbol{\beta}\right\Vert_2^2 \right) \]

and

\[ \begin{aligned} &\underset{\boldsymbol{\beta}}{\mathrm{argmin}}\, \left\Vert\boldsymbol{\beta}\right\Vert_2^2 + C \sum_{n=1}^N\xi_n \textrm{ subject to}\\ &\xi_n \ge 0 \textrm{ and }y_n (\boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \beta_0) \ge 1 - \xi_n \textrm{ for all }n, \end{aligned} \]

for some pair \(\lambda > 0\) and \(C > 0\).

Solutions

Setting \(\xi_n' = (1 - y_n (\boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \beta_0))_+\), the constraints on the latter problem are equivalent to \(\xi_n \ge 0\) and \(\xi_n \ge \xi_n'\), which can then be further simplified to \(\xi_n \ge \xi_n'\), since \(\xi_n'\) is binding if strictly positive and redundant if zero. Since \(C > 0\), the minimum with respect to \(\xi_n\) is always achieved at the constraint lower bound, giving that the latter problem is equivalent to \(\underset{\boldsymbol{\beta}}{\mathrm{argmin}}\, \left\Vert\boldsymbol{\beta}\right\Vert_2^2 + C \sum_{n=1}^N\xi_n',\) viewing \(\xi'_n\) as a function of \(\boldsymbol{\beta}\). Since the argmin is unaffected by multiplication by scalars, the two problems are equivalent.

6 Fourier series and Bochner’s theorem (254 only \(\star\) \(\star\) \(\star\))

Bochner’s theorem states that all positive definite stationary kernels have a pointwise positive Fourier transform. Here, we’ll explore related concepts on a finite domain.

(a)

Suppose we have a class of functions defined by

\[ \mathcal{F}= \{ f: f(z) = \sum_{k=1}^K \beta_k( \phi_{ck}(z) + \phi_{sk}(z) ) \} \]

where \(\phi_{ck}(z) = \cos(\omega_k z)\) and \(\phi_{sk}(z) = \sin(\omega_k z)\). Using these features, define a kernel

\[ \mathcal{K}_{\phi}(z, z') = \sum_{k=1}^K \lambda_k (\phi_{ck}(z) \phi_{ck}(z') + \phi_{sk}(z) \phi_{sk}(z')). \]

Show that \(\mathcal{K}_{\phi}(z, z')\) is a stationary kernel. This shows that Fourier features give stationary kernels.

(b)

Now, suppose we have a stationary, symmetric, positive definite kernel \(\mathcal{K}_{\Phi}{z, z'} = \Phi(z- z')\) for scalar \(z\). Assume that the domain of \(z\) is bounded so that \(\Delta := z- z' \in [-1,1]\). Assume that the kernel can be expanded in a Fourier series:

\[ \Phi(\Delta) = \sum_{k=1}^\infty \left( a_k \cos\left( 2 \pi k \Delta\right) + b_k \sin\left( 2 \pi k \Delta\right) \right) \textrm{ for }\lambda_k \ge 0. \]

(A function has a Fourier series under quite mild regularity conditions.) Show that \(b_k = 0\) for all \(k\).

(c)

Next, show that

\[ \Phi(\Delta) = \sum_{k=1}^\infty \lambda_k (\phi_{ck}(z) \phi_{ck}(z') + \phi_{sk}(z) \phi_{sk}(z')), \]

and identify \(\lambda_k\) and \(\omega_k\) in terms of quantities given in (b). We conclude that stationary kernels have Fourier features.

(d)

Suppose that \(a_k < 0\) for some \(k\). If this is the case, show that \(\mathcal{K}_{\Phi}\) cannot be positive definite. Hint: \(\int_{-1}^1 \cos(2 \pi k z) \cos(2 \pi k' z) dz= 0\) for \(k \ne k'\) — that is, the Fourier basis functions are orthogonal, and you can select test points and vectors to approximate integrals using the Gram matrix.

From this it follows that bounded positive stationary kernels have pointwise positive Fourier series coefficients. Bochner’s theorem states a more general result in terms of kernels on unbounded domains: under some regularity conditions, every positive definite symmetric stationary kernel has a pointwise positive Fourier transform.

Solutions

(a)

Using the trigonometric identity \(\cos(x+ y) = \cos x\cos y- \sin x\sin y\), we can write

\[ \begin{aligned} \phi_{ck}(z) \phi_{ck}(z') + \phi_{sk}(z) \phi_{sk}(z') ={}& \cos(\omega_k z) \cos(\omega_k z') + \sin(\omega_k z) \sin(\omega_k z') \\={}& \cos(\omega_k z) \cos(-\omega_k z') - \sin(\omega_k z) \sin(-\omega_k z') \\={}& \cos(\omega_k (z- z')). \end{aligned} \]

Then \(\mathcal{K}_{\phi}(z, z') = \sum_{k=1}^K \lambda_k \cos(\omega_k (z- z'))\), which depends only on \(z- z'\), and so is stationary.

(b)

Since the kernel is symmetric, \(\Phi(\Delta) = \mathcal{K}_{\Phi}{z, z'} = \mathcal{K}_{\Phi}{z', z} = \Phi(-\Delta)\), so \(0 = \Phi(\Delta) - \Phi(-\Delta)= 2 \sum_{k=1}^\infty b_k \sin\left( 2 \pi k \Delta \right)\), which can only be true for all \(\Delta\) if \(b_k = 0\).

(c)

Again using the cosine addition formula, \(\lambda_k = a_k\) and \(\omega_k = 2\pi k\).

(d)

Fix a grid of points \(\Delta_n\) in \((-1,1)\) that are \(\varepsilon\) apart, and set \(c_n = \cos(2 \pi k \Delta_n)\), where \(k\) is the index with \(a_k < 0\). Let \(\boldsymbol{c}\) denote the vector of \(c_n\) values, and let \(\boldsymbol{A}\) denote the gram matrix evaluated at the points \(\Delta_n\). Then the \(j\)–th row of \(\boldsymbol{A}\boldsymbol{c}\) is

\[ \begin{aligned} (\boldsymbol{A}\boldsymbol{c})_j ={}& \sum_{n=1}^N a_j \cos(2 \pi j \Delta_n) \cos(2 \pi k \Delta_n) \\ \rightarrow {}& a_j \int_{-1}^1 \cos(2 \pi j \Delta) \cos(2 \pi k \Delta) d\Delta \\={}& a_j \mathrm{I}\left(j = k\right) \textrm{ as }\varepsilon \rightarrow 0. \end{aligned} \]

Since the \(j\)–th entry of \(\boldsymbol{c}\) is \(\cos(2 \pi k (-1 + j \varepsilon)) \rightarrow 1\), as \(\varepsilon \rightarrow 0\), for sufficiently small \(\varepsilon\) \(\boldsymbol{c}^\intercal\boldsymbol{A}\boldsymbol{c} < 0\), which shows that \(\mathcal{K}_{\,}\) cannot be positive definite.

7 Positive definiteness of the RBF kernel (254 only \(\star\) \(\star\) \(\star\))

Let \(\Sigma\) denote a \(D \times D\) positive definite matrix. Show that the general RBF kernel

\[ \mathcal{K}_{RBF}(\boldsymbol{x}, \boldsymbol{y}) = \exp\left(- (\boldsymbol{x}- \boldsymbol{y})^\intercal\Sigma (\boldsymbol{x}- \boldsymbol{y}) \right) \]

is positive definite.

Solutions

Now that I write out the solutions, I apologize that my hint was actually not that helpful. By 3(d), the problem is equivalent to asking whether the kernel \(\mathcal{K}_{\,}{}(\boldsymbol{x}, \boldsymbol{y}) = \exp(- (\boldsymbol{x}- \boldsymbol{y})^\intercal(\boldsymbol{x}- \boldsymbol{y}))\) is positive definite, using the invertible transformation \(\boldsymbol{x}\mapsto \Sigma^{-1/2} \boldsymbol{x}\) and \(\boldsymbol{y}\mapsto \Sigma^{-1/2}\boldsymbol{y}\). Then

\[ \exp(- (\boldsymbol{x}- \boldsymbol{y})^\intercal(\boldsymbol{x}- \boldsymbol{y})) = \prod_{p=1}^P \exp(-(x_p - y_p)^2), \]

which is a product kernel and so positive deifnite via 3(c) and our univariate result from class. Going via my hint appears harder due to not being able to take advantage of the Schure product formula in 3(c).

8 Bibliography

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer.