Inner prodcuts and valid kernels
Goals
- Introduce some commonly used kernels
- Define more general notions of inner products and its key properties
- Specify criteria that we need a valid kernel function to satisfy (positive definiteness)
- Heuristically state Bochner’s and Mercer’s theorems for feature expansions of PD kernels
Reading
I will be taking readings from Schölkopf and Smola (2002) and Rasmussen and Williams (2006), both of which are available digitally from the UC Berkeley library website. I’ll also use Hastie, Tibshirani, and Friedman (2009) Chapter 12.
Roadmap
In the last lecture, we expressed ridge regression and support vector classification as kernelized procedures by reducing the problem to inner products of the feature vectors, and replacing those inner products with a kernel function. We motivated this for a particular feature expansion, that of unordered monomials.
Two questions arise:
- What other kernel functions can we use? Does any kernel function give a valid learning procedure?
- How can we generalize the kernel trick to other algorithms not obviously expressed in terms of inner products?
The rest of the unit will answer these two questions. The purpose of this unit is to answer (1), as well as to set up some technical tools to prepare us to answer (2).
Some common kernels
See Table 4.1 (and in fact all of chapter 4) of Rasmussen and Williams (2006). Some kernels to make sure you’re aware of:
\[ \begin{aligned} \mathcal{K}_{Poly}(\boldsymbol{z}, \boldsymbol{z}') ={}& (C + \boldsymbol{z}^\intercal\boldsymbol{z}')^K & \textrm{Polynomial kernel} \\ \mathcal{K}_{RBF}(\boldsymbol{z}, \boldsymbol{z}') ={}& \exp\left( - \frac{\left\Vert z- z'\right\Vert_2^2}{2\ell} \right) & \textrm{Radial basis function kernel} \\ \mathcal{K}_{Mat}(\boldsymbol{z}, \boldsymbol{z}') ={}& \frac{1}{2^{\nu - 1}\Gamma(\nu)} \left(\frac{\sqrt{2 \nu}}{\ell} \left\Vert\boldsymbol{z}- \boldsymbol{z}'\right\Vert_2 \right)^\nu B_\nu \left(\frac{\sqrt{2\nu}}{\ell} \left\Vert\boldsymbol{z}- \boldsymbol{z}'\right\Vert_2 \right) & \textrm{Matern kernel} \\ \mathcal{K}_{Exp}(\boldsymbol{z}, \boldsymbol{z}') ={}& \exp\left(-\left\Vert\boldsymbol{z}- \boldsymbol{z}'\right\Vert_2 / \ell \right) & \textrm{Exponential kernel}. \end{aligned} \]
In the Matern kernel, \(B_\nu\) is the modified Bessel function.
Another nice aspect of kernels is that you can define kernels for objects that might not have an obvious vector space representation. Consider the “string kernel” (Rasmussen and Williams (2006) 4.4.1) on “sentences” \(s\) which are ordered lists of “words” from a finite vocabulary \(\mathcal{V}\). Let \(|v|\) denote the length of word \(v \in \mathcal{V}\), e.g., the number of letters. We can define a kernel
\[ \mathcal{K}_{String}(s, s') = \sum_{v \in \mathcal{V}} \lambda^{|v|} (\textrm{\# of times $v$ occurs in $s$}) \times (\textrm{\# of times $v$ occurs in $s'$}). \]
We will see below that the string kernel is a valid kernel, and so can be used to run regression or classification algorithms on sentences, e.g. to predict a continuous sentiment value, without any further modification of the kernelized algorithm.
Most kernels have tuning parameters, which may be chosen by cross–validation and guided by intuition which we will hopefully develop over the course of the rest of the unit.
Kernels that are functions only of the difference \(\boldsymbol{z}- \boldsymbol{z}'\) are called “stationary kernels,” for reasons that will become clearer later from the perspective of Gaussian processes.
Finally, I will introduce an invalid kernel:
\[ \begin{aligned} \mathcal{K}_{Bad}(\boldsymbol{z}, \boldsymbol{z}') ={}& F(\left\Vert\boldsymbol{z}- \boldsymbol{z}\right\Vert_2) \textrm{ where }F(0) < F(1) & \textrm{Invalid kernel}. \end{aligned} \]
Below, we will show that there can be no feature representation \(\boldsymbol{x}= \phi(\boldsymbol{z})\) and inner product such that \(\mathcal{K}_{Bad}(\boldsymbol{z}, \boldsymbol{z}') = <\phi(\boldsymbol{z}), \phi(\boldsymbol{z}')>\), in any space.
Inner product spaces
To proceed, we need to move beyond the dot product \(\boldsymbol{x}^\intercal\boldsymbol{x}\), which is defined in terms of a mechanical operation on finite vectors, and define an abstract inner product, as well as an inner product space.
An inner product on a space \(\mathcal{X}\) is a function \(<\cdot, \cdot>: \mathcal{X}\times \mathcal{X}\mapsto \mathbb{R}^{}\) that satisfies the following properties:
\[ \begin{aligned} <x, x'> ={}& <x', x> & \textrm{Symmetry} \\ <a x+ b x'', x'> ={}& a <x, x'> + b <x'', x'> & \textrm{Linearity} \\ <x, x> \ge{}& 0 \textrm{ with }<x, x> = 0 \textrm{ only if }x= 0. & \textrm{Positive definiteness}. \end{aligned} \]
Note that the dot product \(\boldsymbol{x}^\intercal\boldsymbol{x}\) is an inner product when \(\mathcal{X}= \mathbb{R}^{P}\), but so is \(<\boldsymbol{x}, \boldsymbol{x}>_\boldsymbol{A}:= \boldsymbol{x}^\intercal\boldsymbol{A}\boldsymbol{x}\) when \(\boldsymbol{A}\) is any positive definite matrix.
A vector space is a set of objects (e.g. \(\mathbb{R}^{P}\), where the objects are lists of \(P\) real numbers), together with rules for adding those objects and multiplying them by scalars. An inner product space is a vector space associated with an inner product, which we can write as the pair \((\mathcal{X}, <\cdot, \cdot>)\). An inner product space that is complete with respect to the norm induced by \(<\cdot, \cdot>\) is called a “Hilbert space.”
Note that a given vector space could be associated with many different inner product spaces! These inner product spaces contain the same objects, but have different metrics, meaning different notions of distance and size. For example, if \(A\) is a positive definite matrix not equal to the identity, the inner product spaces \((\mathcal{X}, <\cdot, \cdot>_{\boldsymbol{I}_{\,}{P}})\) and \((\mathcal{X}, <\cdot, \cdot>_{\boldsymbol{I}_{\,}{A}})\) both contain \(P\)–vectors, but in general have different measures of the “size” of \(\boldsymbol{x}\): \(<\boldsymbol{x}, \boldsymbol{x}>_{\boldsymbol{I}_{\,}{P}} = \boldsymbol{x}^\intercal\boldsymbol{x}\) and \(<\boldsymbol{x}, \boldsymbol{x}>_{\boldsymbol{A}} = \boldsymbol{x}^\intercal\boldsymbol{A}\boldsymbol{x}\), respectively.
Similarly, if you have objects which you wish to associate with a well–defined size and similarity, the same measurement of size and similarity could be equally represented by different inner product spaces. As an example, consider the second–order monomials on \(\mathcal{Z}= \mathbb{R}^{2}\), where \(\boldsymbol{z}\in \mathcal{Z}\) is written \(\boldsymbol{z}= (z_1, z_2)^\intercal\). Last lectures, we studied the ordered second–order monomial feature map \(\phi(\boldsymbol{z}): \mathbb{R}^{2} \mapsto \mathbb{R}^{4}\) as \(\phi(\boldsymbol{z}) = (z_1^2, z_1 z_2, z_2 z_1, z_2^2)^\intercal\), which we use to define the inner product
\[ \begin{aligned} \mathcal{K}_{poly}(\boldsymbol{z}, \boldsymbol{s}) ={}& \\={}& <\phi(\boldsymbol{z}), \phi(\boldsymbol{s})> \\={}& z_1^2 s_1^2 + z_1 z_2 s_1 s_2 + z_2 z_1 s_2 s_1 + z_2^2 s_2^2 \\={}& z_1^2 s_1^2 + 2 z_1 z_2 s_1 s_2 + z_2^2 s_2^2 \\={}& <\tilde{\phi}(\boldsymbol{z}), \tilde{\phi}(\boldsymbol{s})>_{\boldsymbol{A}}, \end{aligned} \]
where \(\tilde{\phi}(\boldsymbol{z}) = (z_1^2, z_1 z_2, z_2^2) \in \mathbb{R}^{3}\) is the unordered monomial map that treats \(z_1 z_2\) the same as \(z_2 z_1\), and the matrix
\[ \boldsymbol{A}= \begin{pmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 1 \\ \end{pmatrix} \]
appropriately accounts for the double counting of the mixed term. Our notion of size and distance could equally well be represented by \((\mathbb{R}^{4}, <\cdot,\cdot>)\) or \((\mathbb{R}^{3}, <\cdot,\cdot>)_\boldsymbol{A}\).
We can also define inner products on spaces that are not \(\mathbb{R}^{P}\). For example, we could take \(\mathcal{X}= \{\boldsymbol{x}= (x_1, x_2, \ldots): \sum_{p=1}^\infty x_p^2 < \infty\}\) to be square–summable, infinitely long sequences of numbers, and define \(<\boldsymbol{x}, \boldsymbol{x}'> := \sum_{p=1}^\infty x_p x_p'\). This is sometimes called \(\ell_2\).
We can also take spaces of functions. Loosely speaking, we can let \(\mathcal{X}= \{ f(\boldsymbol{x}): \int f(\boldsymbol{x})^2 d\boldsymbol{x}< \infty \}\), and define \(<f, g> = \int f(\boldsymbol{x}), g(\boldsymbol{x}) d\boldsymbol{x}\). Up to measure theoretic details, this space is sometimes called \(L_2\).
Valid kernels
For a given kernel \(\mathcal{K}_{\,}: \mathcal{Z}\times\mathcal{Z}\mapsto\mathbb{R}^{}\), we would like to establish whether there exists any inner product space \((\mathcal{X}, <\cdot, \cdot>)\) such that \(\mathcal{X}\supseteq \{ \boldsymbol{x}: \boldsymbol{x}= \phi(\boldsymbol{z}), \boldsymbol{z}\in \mathcal{Z}\}\), and \(\mathcal{K}_{\,}(\boldsymbol{z}, \boldsymbol{s}) = <\phi(\boldsymbol{z}), \phi(\boldsymbol{s})>\). Per our discussion above, there may be many such spaces, but if there isn’t at least one, then it does not make much sense to use \(\mathcal{K}_{\,}\) for the kernel trick.
Obviously it would be enough to identify a particular \(\phi\) and \(<\cdot,\cdot>\) that does the trick. But that may be hard in general, and we need a way of proving the converse, which is that there exists no such \(\phi\).
The key is to define the quantity called the gram matrix, which is the kernel evaluated at a set of observations \(\boldsymbol{Z}= (\boldsymbol{z}_1, \ldots, \boldsymbol{z}_N)^\intercal\):
\[ \mathcal{K}_{\boldsymbol{Z}} = \begin{pmatrix} \mathcal{K}_{\,}(\boldsymbol{z}_1, \boldsymbol{z}_1) & \mathcal{K}_{\,}(\boldsymbol{z}_1, \boldsymbol{z}_2) & \ldots \mathcal{K}_{\,}(\boldsymbol{z}_1, \boldsymbol{z}_N) \\ \mathcal{K}_{\,}(\boldsymbol{z}_2, \boldsymbol{z}_1) & \mathcal{K}_{\,}(\boldsymbol{z}_2, \boldsymbol{z}_2) & \ldots \mathcal{K}_{\,}(\boldsymbol{z}_2, \boldsymbol{z}_N) \\ \vdots & \vdots & \vdots \\ \mathcal{K}_{\,}(\boldsymbol{z}_N, \boldsymbol{z}_1) & \mathcal{K}_{\,}(\boldsymbol{z}_N, \boldsymbol{z}_2) & \ldots \mathcal{K}_{\,}(\boldsymbol{z}_N, \boldsymbol{z}_N) \\ \end{pmatrix}. \]
If there exists a \(\phi\) and inner product that fits the bill, then every gram matrix must be positive semidefinite for the following reason:
\[ \begin{aligned} \boldsymbol{v}^\intercal\mathcal{K}_{\boldsymbol{Z}} \boldsymbol{v} ={}& \sum_{n=1}^N\sum_{m=1}^N \boldsymbol{v}_n \mathcal{K}_{\,}(\boldsymbol{z}_n, \boldsymbol{z}_m) \boldsymbol{v}_m \\={}& \sum_{n=1}^N\sum_{m=1}^N \mathcal{K}_{\,}(\boldsymbol{v}_n \boldsymbol{z}_n, \boldsymbol{z}_m \boldsymbol{v}_m) & \textrm{(linearity)} \\={}& \mathcal{K}_{\,}\left(\sum_{n=1}^N\boldsymbol{v}_n \boldsymbol{z}_n, \sum_{m=1}^N \boldsymbol{z}_m \boldsymbol{v}_m\right) & \textrm{(linearity again)} \\={}& \mathcal{K}_{\,}\left(\sum_{n=1}^N\boldsymbol{v}_n \boldsymbol{z}_n, \sum_n \boldsymbol{z}_n \boldsymbol{v}_n\right) & \textrm{(renaming an index)} \\\ge{}& 0. & \textrm{(positive definiteness)} \end{aligned} \]
Note that the preceding construction never used the feature vector itself — it only used properties that \(\mathcal{K}_{\,}\) must have if it represents an inner product in some space. Note also that we are not guaranteed that the final line is strictly greater than zero since the feature map may discard information from \(\mathcal{Z}\). For example, we might take \(\boldsymbol{z}= (z_1, z_2)\) and \(\phi(\boldsymbol{z}) = (z_1)\), in which case \(\mathcal{K}_{\,}((0, 1), (0,1)) = 0\).
From this we conclude that if \(\mathcal{K}_{\,}\) represents an inner product in some space, then every gram matrix is positive semi–definite. This leads to the following definition:
A kernel function \(\mathcal{K}_{\,}: \mathcal{Z}\times \mathcal{Z}\mapsto \mathbb{R}^{}\) is a “positive definite kernel” if every gram matrix is symmetric and positive semi–definite.
Note the unfortunate disconnect in notation between “positive definite” kernels and “positive semi–definite” matrices. We call a kernel that leads to positive definite matrices as a “strictly positive definite kernel”.
We will see in the coming lectures that positive definite kernels always define feature maps in an appropriate inner product space — that is, positive semi–definiteness of the gram matrix is also sufficient. We will show this two different ways: first by choosing an inner product space and finding a feature vector, and second by defining an inner product space in which the feature vectors can be identified trivially.
Establishing (or refuting) positive definiteness of a kernel
So if we’re given a candidate kernel, how can we discover whether it is postive definite? For a general black–box kernel, this is not necessarily easy. The techniques I am aware of essentially amount to identifying a feature expansion of the form:
\[ \mathcal{K}_{\,}(\boldsymbol{z}, \boldsymbol{s}) = \sum_{k=1}^\infty \lambda_k \phi_k(\boldsymbol{z}) \phi_k(\boldsymbol{s}) \quad\textrm{for}\quad\lambda_k > 0. \]
Compare this with writing a matrix \(\boldsymbol{A}\) as a sum of rank–one matrices:
\[ \boldsymbol{A}= \sum_{k=1}^K \lambda_k v_k v_k^\intercal. \]
There are three things to note:
- There are many different ways to write \(\boldsymbol{A}\) in this way
- No matter how you write it, \(\boldsymbol{A}\) is positive semi–definite if and only if \(\lambda_k \ge 0\)
- Up to re-indexing and repeated \(\lambda_k\), there is only one way to write this where \(v_k\) are orthogonal, and that is the eigen representation.
All these facts, familiar from matrices, extend to the case of kernel functions. In fact, though we will not prove it here, Mercer’s theorem states that any positive definite kernel admits a feature expansion with orthogonal features, where orthogonality is with respect to a function space inner product.
Note that an expansion of the above form is little more than identifying the (possibly infinite–dimensional) feature vector \(\boldsymbol{z}\mapsto \sqrt{\lambda_k} \phi_k(\boldsymbol{z})\). But it does point to the possibility of using standard series expansion techniques. Below, we will use Fourier series to find series expansions for stationary kernels. For example, taking \(z\in \mathbb{R}^{}\), we can find a series expansion for \(\mathcal{K}_{RBF}(z, s) = \exp\left(-\frac{1}{2}(z- s)^2\right)\) using a Taylor series:
\[ \begin{aligned} \exp\left(-\frac{1}{2}(z- s)^2\right) ={}& \exp\left(-\frac{1}{2}z^2 \right) \exp\left(-\frac{1}{2}s^2 \right)\exp\left(zs\right) \\={}& \exp\left(-\frac{1}{2}z^2 \right) \exp\left(-\frac{1}{2}s^2 \right) \sum_{k=1}^\infty \frac{1}{k!} (zs)^k \\={}& \\={}& \sum_{k=1}^\infty \frac{1}{k!} \left(\exp\left(-\frac{1}{2}z^2 \right) z^k \right) \left(\exp\left(-\frac{1}{2}s^2 \right) s^k \right). \end{aligned} \]
From this it follows that the kernel \(\mathcal{K}_{RBF}(z, s)\) can be featurized with \(\phi_k(z) = \exp\left(-\frac{1}{2}z^2 \right) z^k\).
An example of an invalid kernel
It is often easier to prove that a kernel is not positive definite, since it suffices to find a single set of points for which the gram matrix is not positive definite. This can lead to some necessary conditions on kernels, as we now see.
First, we show that \(\mathcal{K}_{Bad}\) is an invalid kernel by our definition. Recall that
\[ \begin{aligned} \mathcal{K}_{Bad}(\boldsymbol{z}, \boldsymbol{z}') ={}& F(\left\Vert\boldsymbol{z}- \boldsymbol{z}\right\Vert_2) \textrm{ where }F(0) < F(1) & \textrm{Invalid kernel}. \end{aligned} \]
Take \(\boldsymbol{z}_A\) and \(\boldsymbol{z}_B\) such that \(\left\Vert\boldsymbol{z}_A - \boldsymbol{z}_B\right\Vert_2 = 1\). Then the gram matrix for these two matrices is
\[ \mathcal{K}_{Bad,\boldsymbol{Z}} = \begin{pmatrix} \mathcal{K}_{Bad}(\boldsymbol{z}_A, \boldsymbol{z}_A) & \mathcal{K}_{Bad}(\boldsymbol{z}_A, \boldsymbol{z}_B) \\ \mathcal{K}_{Bad}(\boldsymbol{z}_B, \boldsymbol{z}_A) & \mathcal{K}_{Bad}(\boldsymbol{z}_B, \boldsymbol{z}_B) \\ \end{pmatrix} ={} \begin{pmatrix} F(0) & F(1) \\ F(1) & F(0) \\ \end{pmatrix}. \]
This gram matrix is not positive definite. In particular, its determinant is \(F(0)^2 - F(1)^2 < 0\), and the determinant is the product of the eigenvalues, so exactly one eigenvalue is negative.
We can conclude from this that a valid stationary kernel always achieves its maximum at zero.
Stationary kernels and Fourier series
For stationary kernels on a bounded domain, the Fourier series provides a natural basis expansion.
To begin, one might wonder how you could possibly have feature vectors for a stationary kernel: a stationary kernel is a function only of the difference \(\mathcal{K}_{\,}{\boldsymbol{z},\boldsymbol{s}} = \kappa(\boldsymbol{z}- \boldsymbol{s})\) for some \(\kappa\), but the feature vectors \(\phi_k(\boldsymbol{z}) \phi_k(\boldsymbol{z})\) are not functions of the difference.
To resolve this apparent contrdiction, take \(zv = z\in \mathbb{R}^{}\) and consider the feature vector \[ \begin{aligned} \phi_{\omega}(z) = \begin{pmatrix} \cos(\omega z) \\ \sin(\omega z) \\ \end{pmatrix} \end{aligned}. \]
Then
\[ \begin{aligned} \phi_{\omega}(z)^\intercal\phi_{\omega}(s) ={}& \begin{pmatrix} \cos(\omega z) & \sin(\omega z) \end{pmatrix} \begin{pmatrix} \cos(\omega s) \\ \sin(\omega s) \\ \end{pmatrix} \\={}& \cos(\omega z) \cos(\omega s) + \sin(\omega z) \sin(\omega s) \\={}& \cos(\omega (z- s)) \end{aligned} \]
where the final line follows from a standard trig identity. It follows that if we were to write a regression function as
\[ f(z; \beta) = \sum_{\omega} \beta_\omega (\cos(\omega z) + \sin(\omega z)) \]
for some coefficients \(\beta_\omega\), then the implicit kernel corresponding to this inner product would be a stationary kernel. (If you’re bothered by the fact that this regression constrains the sine and cosine to have the same coefficient, note that you can avoid this technicality by admitting complex–valued features, use Euler’s identity, and use the standard complex number inner product \(<\boldsymbol{z}, \boldsymbol{z}> = \boldsymbol{z}^\intercal\overline{\boldsymbol{z}}\).)
This strongly motivates looking for Fourier features as the feature expansions for stationary kernels. In fact, if \(\mathcal{K}_{\,}(z, s) = \kappa(z- s)\) is stationary and defined on a finite domain, then we can define \(z- s= \Delta\) and assume \(\Delta \in [-1,1]\) without loss of generality by defining \(\kappa(-\Delta) = \kappa(\Delta)\). Then we can represent \(\kappa(\Delta)\) with a Fourier series:
\[ \kappa(\Delta) = \sum_{k=0}^\infty \lambda_k \cos(2\pi k \Delta), \]
where we note that the sine terms are zero because \(\kappa\) is symmetric. Plugging in,
\[ \begin{aligned} \cos(2\pi k \Delta) ={}& \cos(2\pi k (z- s)) \\={}& \cos(2\pi k z) \cos(2\pi k s) + \sin(2\pi k z) \sin(2\pi k s) \Rightarrow \\ \mathcal{K}_{\,}(z, s) ={}& \sum_{k=0}^\infty \lambda_k \cos(2\pi k z) \cos(2\pi k s) + \sum_{k=0}^\infty \lambda_k \sin(2\pi k z) \sin(2\pi k s). \end{aligned} \]
We thus see that a stationary kernel on a bounded domain is positive definite if and only if the Fourier series of \(\Delta \mapsto \kappa(\Delta)\) has positive coefficients. The generalization of this to functions on the whole domain is called Bochner’s theorem. A precise statement requires Fourier transforms and some measure theory details, and so is beyond the scope of this course. But the point is that stationary kernels can be expressed using Fourier features.
Interestingly, we also see that, if \(k \ne k'\), then
\[ \int \cos(2\pi k z) \cos(2\pi k' z) dz= \int \cos(2\pi k z) \sin(2\pi k' z) dz= 0, \]
so we have found not only features, but orthogonal features. In fact, for bounded domains, the Fourier features are precisely the features guaranteed to exist by Mercer’s theorem.