Homework 5 (due May 3rd 9pm)

Stat 154/254: Statistical Machine Learning

1 Inner products

Recall that an inner product on a vector space must satisfy symmetry, linearity, and positive definitness. Show that each of the following inner products satisfy these properties.

(In each problem, you may assume without proof that the space is a valid vector space — you need only verify the properties of the inner product.)

(a) Dot product

Let \(\boldsymbol{x}\in \mathbb{R}^{D}\), and define \(\left<\boldsymbol{x}, \boldsymbol{s}\right> := \boldsymbol{x}^\intercal\boldsymbol{s}\).

(b) Positive definite matrices

Let \(\boldsymbol{x}\in \mathbb{R}^{D}\), let \(\boldsymbol{A}\) denote a symmetric positive definite matrics, and define \(\left<\boldsymbol{x}, \boldsymbol{s}\right>_\boldsymbol{A}:= \boldsymbol{x}^\intercal\boldsymbol{A}\boldsymbol{s}\).

(c) Infinite series

Let \(\mathcal{X}= \{ \boldsymbol{x}= (x_1, x_2, \ldots) \textrm{ such that } \sum_{n=1}^\infty x_n^2 < \infty \}\), and let \(\left<\boldsymbol{x}, \boldsymbol{s}\right> = \sum_{n=1}^\infty a_n x_n s_n\) for \(a_n \ge 0\).

(d) Continuous functions

Let \(\mathcal{F}= \{ f: \mathcal{X}\mapsto \mathbb{R}^{} \textrm{ such that } f\textrm{ is continuous and } \int f(\boldsymbol{x})^2 \mu(\boldsymbol{x}) d\boldsymbol{x}< \infty \}\) for some \(\mu(\boldsymbol{x})\).

(e) Covariances

Let \(\mathcal{F}= \{ f: \mathcal{X}\mapsto \mathbb{R}^{} \textrm{ such that } f\textrm{ is continuous and } \mathbb{E}_{\,}\left[f(\boldsymbol{x})^2\right] < \infty \}\) where \(\boldsymbol{x}\) is a random variable, and \(\left<f, g\right>_\mathcal{X}= \mathbb{E}_{\boldsymbol{x}}\left[f(\boldsymbol{x}) g(\boldsymbol{x})\right]\). You may assume that every set in the domain of \(\boldsymbol{x}\) has positive probability.

(f) RKHS inner products

Let \(\mathcal{K}_{\,}(\cdot, \cdot)\) denote a symmetric, positive definite kernel, let \(\mathcal{H}= \{ f: \mathcal{X}\mapsto\mathbb{R}^{} \textrm{ such that } f(\boldsymbol{x}) = \sum_{k=1}^K a_k \mathcal{K}_{\,}(\boldsymbol{x}, \boldsymbol{x}_k) \textrm{ for some }\boldsymbol{a},\boldsymbol{x}_1\ldots,\boldsymbol{x}_K\}\), and let \(\left<\cdot,\cdot\right>_\mathcal{H}\) denote the usual RKHS inner product.

2 Regression as an RKHS

Suppose we observe IID pairs \((\boldsymbol{z}, y)\), and have defined a feature map \(\phi:\mathcal{Z}\mapsto \mathbb{R}^{P}\) for regression. Let

\[ \mathcal{F}=\{f:\mathcal{Z}\mapsto \mathbb{R}^{} \textrm{ such that }f(\boldsymbol{z}) = \phi(\boldsymbol{z})^\intercal\boldsymbol{\beta} \textrm{ for some } \boldsymbol{\beta}\in \mathbb{R}^{P}\} \]

denote the class of regression functions. We will express this class of functions as an RKHS using the kernel \(\mathcal{K}_{\,}(\boldsymbol{z}, \boldsymbol{z}') = \phi(\boldsymbol{z})^\intercal\phi(\boldsymbol{z}')\).

You may assume that the second moments of \(\phi(\boldsymbol{z})\) are finite, i.e. that \(\mathbb{E}_{\mathcal{Z}}\left[\phi(\boldsymbol{z}) \phi(\boldsymbol{z})^\intercal\right] < \infty\).

(a)

Suppose that the final component of \(\phi(\boldsymbol{z})\), \(\phi_P(\boldsymbol{z})\), can be written as a linear combination of the others. In other words, suppose there exists \(a_1, \ldots, a_{P-1}\) such that \(\phi_P(\boldsymbol{z}) = a_1 \phi_1(\boldsymbol{z}) + \ldots + a_{P-1}\phi_{P-1}(\boldsymbol{z})\) for all \(\boldsymbol{z}\). Define a class \(\mathcal{F}'\) containing the same function as \(\mathcal{F}\) but using only the first \(P-1\) components of \(\phi(\boldsymbol{z})\).

Given this result, we will assume from now on that we have written \(\mathcal{F}\) using the minimal dimension \(P\), and that we cannot write one component of \(\phi\) as a linear combination of the other components for all \(\boldsymbol{z}\).

(b)

Let

\[ \mathcal{H}= \{ f:\mathcal{Z}\mapsto \mathbb{R}^{} \textrm{ such that } f(\boldsymbol{z}) = \sum_{k=1}^K a_k \mathcal{K}_{\,}(\boldsymbol{z}, \boldsymbol{z}_k) \textrm{ for some }\boldsymbol{a}, \boldsymbol{z}_1,\ldots,\boldsymbol{z}_K \}, \]

and let \(\left<\cdot,\cdot\right>_\mathcal{H}\) denote the usual RKHS inner product. Show that \(f\in \mathcal{H}\) if and only if \(f\in \mathcal{F}\).

(c)

Suppose \(f(\cdot; \boldsymbol{\beta}) \in \mathcal{F}\) means \(f(\boldsymbol{z}; \boldsymbol{\beta}) = \phi(\boldsymbol{z})^\intercal\boldsymbol{\beta}\). Show that \(\left<f(\boldsymbol{z}; \boldsymbol{\beta}), f(\boldsymbol{z}; \boldsymbol{\beta}')\right>_\mathcal{H}= \boldsymbol{\beta}^\intercal\boldsymbol{\beta}'\).

(d)

Define the expectation inner product

\[ \left<f(\boldsymbol{z}; \boldsymbol{\beta}), f(\boldsymbol{z}; \boldsymbol{\beta}')\right>_{\mathcal{Z}} = \mathbb{E}_{\mathcal{Z}}\left[f(\boldsymbol{z}; \boldsymbol{\beta}) f(\boldsymbol{z}; \boldsymbol{\beta}')\right]. \]

Under what conditions are the two inner products equal? That is, when does \(\left<f(\boldsymbol{z}; \boldsymbol{\beta}), f(\boldsymbol{z}; \boldsymbol{\beta}')\right>_{\mathcal{Z}} = \left<f(\boldsymbol{z}; \boldsymbol{\beta}), f(\boldsymbol{z}; \boldsymbol{\beta}')\right>_\mathcal{H}\) for all \(f\)?

(e)

Show that the solution to the following two problems give the same regression function:

\[ \underset{\boldsymbol{\beta}\in \mathbb{R}^{P}}{\mathrm{argmin}}\, \left(\sum_{n=1}^N(y_n - \boldsymbol{\beta}^\intercal\phi(\boldsymbol{z}_n))^2 + \left\Vert\boldsymbol{\beta}\right\Vert_2^2 \right) \quad\textrm{and}\quad \underset{f\in \mathcal{H}}{\mathrm{argmin}}\, \left(\sum_{n=1}^N(y_n - f(\boldsymbol{z}_n))^2 + \left\Vert f\right\Vert_{\mathcal{H}}^2 \right). \]

(f)

In light of (d) and (e), comment on an RKHS interpretation of the practice of standardizing regressors before performing ridge regression.

3 Combining kernels

A key way to produce valid kernels to is combine and modify kernels known to be positive definite.

Suppose that \(\mathcal{K}_{1}\) and \(\mathcal{K}_{2}\) are two symmetric positive definite kernels defined on the same space, \(\mathcal{Z}\times \mathcal{Z}\). Show that the following are also symmetric positive definite kernels.

(a)

\(\mathcal{K}_{C}(\boldsymbol{z}, \boldsymbol{s}) = C \mathcal{K}_{1}(\boldsymbol{z}, \boldsymbol{s})\) for \(C > 0\).

(b)

\(\mathcal{K}_{+}(\boldsymbol{z}, \boldsymbol{s}) = \mathcal{K}_{1}(\boldsymbol{z}, \boldsymbol{s}) + \mathcal{K}_{2}(\boldsymbol{z}, \boldsymbol{s})\).

(c)

\(\mathcal{K}_{\times}(\boldsymbol{z}, \boldsymbol{s}) = \mathcal{K}_{1}(\boldsymbol{z}, \boldsymbol{s}) \times \mathcal{K}_{2}(\boldsymbol{z}, \boldsymbol{s})\).

Hint: You can use the Schur product theorem.

(d)

Suppose that \(\phi: \mathcal{Z}\mapsto \mathcal{X}\) is an invertible map. Then define the kernel on \(\mathcal{X}\times \mathcal{X}\):

\(\mathcal{K}_{\phi}(\boldsymbol{x}, \boldsymbol{y}) = \mathcal{K}_{1}(\phi^{-1}(\boldsymbol{x}), \phi^{-1}(\boldsymbol{y}))\).

4 The ridgeless estimator and minimum norm interpolators

For a function \(f\in \mathcal{H}\) in an RKHS, let \(\hat{\mathscr{L}}(f) = \sum_{n=1}^N(y_n - f(\boldsymbol{x}_n))^2\). Recall the dual form of ridge regression: for any \(\lambda\), there exists a \(B(\lambda)\) (depending on \(\lambda\)) such that the following two problems are equivalent:

\[ \hat{f}_\lambda := \underset{f\in \mathcal{H}}{\mathrm{argmin}}\, \left( \hat{\mathscr{L}}(f) + \lambda \left\Vert f\right\Vert_\mathcal{H}^2 \right) \quad\textrm{and}\quad \underset{f: \left\Vert f\right\Vert_\mathcal{H}^2 \le B(\lambda)}{\mathrm{argmin}}\, \hat{\mathscr{L}}(f). \]

You may assume that the kernel \(\mathcal{K}_{\,}{}\) used to construct the RKHS has an infinite set of linearly independent eigenfunctions.

(a)

Argue that, as \(\lambda \rightarrow \infty\), \(B(\lambda) \rightarrow 0\), and that as \(\lambda \rightarrow 0\), \(B(\lambda) \rightarrow \infty\). A simple heuristic argument is enough, you do not need to prove it formally.

(b)

Use the representer theorem to write out an explicit solution for \(\hat{f}_\lambda\).

(c)

Let \(\hat{f}_{0+}\) denote the limit of \(\hat{f}_\lambda\) as \(\lambda \rightarrow 0\). This is sometimes called the “ridgeless estimator.”

In general, does \(\hat{\mathscr{L}}(\hat{f}_\lambda) \rightarrow 0\) as \(\lambda \rightarrow 0\)? Hint: a simple argument uses the dual form of the objective function.

(d)

In general, there are many distinct functions \(f\in \mathcal{H}\) that interpolate the data, i.e., that satisfy \(f(\boldsymbol{x}_n) = y_n\) for all \(n=1, \ldots, N\), each with a different norm \(\left\Vert f\right\Vert_\mathcal{H}\). Among these, which \(f\) is selected by \(\hat{f}_{0+}\)?

(e)

If you had chosen a different kernel, and so a different RKHS, would have you have gotten the same answer for \(\hat{f}_{0+}\)?

5 Support vector machines as empirical risk minimization

(a) (Hastie, Tibshirani, and Friedman (2009) 12.1)

Show that the following two problems are equivalent ways of representing a support vector classifier with slack variables:

\[ \underset{\beta_0, \boldsymbol{\beta}}{\mathrm{argmin}}\, \left( \sum_{n=1}^N(1 - y_n (\boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \beta_0))_{+} + \frac{\lambda}{2} \left\Vert\boldsymbol{\beta}\right\Vert_2^2 \right) \]

and

\[ \begin{aligned} &\underset{\boldsymbol{\beta}}{\mathrm{argmin}}\, \left\Vert\boldsymbol{\beta}\right\Vert_2^2 + C \sum_{n=1}^N\xi_n \textrm{ subject to}\\ &\xi_n \ge 0 \textrm{ and }y_n (\boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \beta_0) \ge 1 - \xi_n \textrm{ for all }n. \end{aligned} \]

6 Fourier series and Bochner’s theorem (254 only \(\star\) \(\star\) \(\star\))

Bochner’s theorem states that all positive definite stationary kernels have a pointwise positive Fourier transform. Here, we’ll explore related concepts on a finite domain.

(a)

Suppose we have a class of functions defined by

\[ \mathcal{F}= \{ f: f(z) = \sum_{k=1}^K \beta_k( \phi_{ck}(z) + \phi_{sk}(z) ) \} \]

where \(\phi_{ck}(z) = \cos(\omega_k z)\) and \(\phi_{sk}(z) = \sin(\omega_k z)\). Using these features, define a kernel

\[ \mathcal{K}_{\phi}(z, z') = \sum_{k=1}^K \lambda_k (\phi_{ck}(z) \phi_{ck}(z') + \phi_{sk}(z) \phi_{sk}(z')). \]

Show that \(\mathcal{K}_{\phi}(z, z')\) is a stationary kernel. This shows that Fourier features give stationary kernels.

(b)

Now, suppose we have a stationary, symmetric, positive definite kernel \(\mathcal{K}_{\Phi}{z, z'} = \Phi(z- z')\) for scalar \(z\). Assume that the domain of \(z\) is bounded so that \(\Delta := z- z' \in [-1,1]\). Assume that the kernel can be expanded in a Fourier series:

\[ \Phi(\Delta) = \sum_{k=1}^\infty \left( a_k \cos\left( 2 \pi k \Delta\right) + b_k \sin\left( 2 \pi k \Delta\right) \right) \textrm{ for }\lambda_k \ge 0. \]

(A function has a Fourier series under quite mild regularity conditions.) Show that \(b_k = 0\) for all \(k\).

(c)

Next, show that

\[ \Phi(\Delta) = \sum_{k=1}^\infty \lambda_k (\phi_{ck}(z) \phi_{ck}(z') + \phi_{sk}(z) \phi_{sk}(z')), \]

and identify \(\lambda_k\) and \(\omega_k\) in terms of quantities given in (b). We conclude that stationary kernels have Fourier features.

(d)

Suppose that \(a_k < 0\) for some \(k\). If this is the case, show that \(\mathcal{K}_{\Phi}\) cannot be positive definite. Hint: \(\int_{-1}^1 \cos(2 \pi k z) \cos(2 \pi k' z) dz= 0\) for \(k \ne k'\) — that is, the Fourier basis functions are orthogonal, and you can select test points and vectors to approximate integrals using the Gram matrix.

From this it follows that bounded positive stationary kernels have pointwise positive Fourier series coefficients. Bochner’s theorem states a more general result in terms of kernels on unbounded domains: under some regularity conditions, every positive definite symmetric stationary kernel has a pointwise positive Fourier transform.

7 Positive definiteness of the RBF kernel (254 only \(\star\) \(\star\) \(\star\))

Let \(\Sigma\) denote a \(D \times D\) positive definite matrix. Show that the general RBF kernel

\[ \mathcal{K}_{RBF}(\boldsymbol{x}, \boldsymbol{y}) = \exp\left(- (\boldsymbol{x}- \boldsymbol{y})^\intercal\Sigma (\boldsymbol{x}- \boldsymbol{y}) \right) \]

is positive definite.

Hint: it suffices to produce a series expansion with positive coefficients; you do not need to produce an orthogonal expansion. So you can use the same trick as in the lecture notes using only the expansion \(\exp(z) = \sum_{k=0}^\infty \frac{z^k}{k!}\) after representing everything in terms of the eigenvectors and eigenvalues of \(\Sigma\).

8 Bibliography

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer.