Support Vector Machines

Author

Erez Buchweitz

From perceptron to SVM

The perceptron algorithm finds a hyperplane that separates the training data, meaning \[ y_n = \text{sign}(\boldsymbol{x}_n^T\hat{\boldsymbol{\beta}}) \; \; \; \; \text{for all}\; \; n=1,\dots,N, \] by running stochastic gradient descent to minimize hinge loss. A crucial assumption is that the data are indeed linearly separable.

SVMs add three layers on top of this:

Regularization: find the separating hyperplane with maximum margin.
Soft margin: allow for margin mistakes (e.g. not linearly separable).
Kernel: expressive (possibly infinite) feature mapping.

All three together amount to what is reffered to as SVM.

Regularization

For finite data that are linearly separable, there always are infinitely many separating hyperplanes. Which one does perceptron converge to? If it does con implicit, so let’s make it implicit and ask for a particular one. As a form of regularization, we’ll ask for the one with maximum margin: \[ \hat{\boldsymbol{\beta}}_{\text{hard}} = \underset{\|\boldsymbol{\beta}\|_2=1,\gamma}{\mathrm{argmax}}\, \; \; \; \; \gamma \; \; \; \; \text{subject to} \; \; \; \; y_n\boldsymbol{x}_n^T\boldsymbol{\beta}\geq \gamma \; \; \forall n=1,\dots,N. \] This is known as a hard margin (linear) SVM. To be clear, though the maximization is over two parameters, we are only interested in the normal \(\hat{\boldsymbol{\beta}}_{\text{hard}}\). The requirement that \(\hat{\boldsymbol{\beta}}_{\text{hard}}\) be a unit vector is necessary, because \(\hat{\boldsymbol{\beta}}_{\text{hard}}\) can always be rescaled to increase \(\gamma\) artificially.

The next proposition shows that maximizing the margin is equivalent to \(\ell_2\)-regularizing the normal.

Proposition: \[ \hat{\boldsymbol{\beta}}_{\text{hard}} = \underset{\boldsymbol{\beta}}{\mathrm{argmin}}\, \; \; \; \; \|\boldsymbol{\beta}\|_2 \; \; \; \text{subject to} \; \; \; \; y_n\boldsymbol{x}_n^T\boldsymbol{\beta}\geq 1 \; \; \forall n=1,\dots,N. \]

Proof:

Make the change of variable \(\tilde{\boldsymbol{\beta}}=\frac{1}{\gamma}\boldsymbol{\beta}\): \[ \underset{\|\boldsymbol{\beta}\|_2=1,\gamma}{\mathrm{argmax}}\, \left\{\gamma: y_n\boldsymbol{x}_n^T\boldsymbol{\beta}\geq \gamma \; \; \forall n=1,\dots,N\right\} = \underset{\|\tilde{\boldsymbol{\beta}}\|_2=1/\gamma,\gamma}{\mathrm{argmax}}\, \left\{\frac{1}{\|\tilde{\boldsymbol{\beta}}\|_2}: y_n\boldsymbol{x}_n^T\tilde{\boldsymbol{\beta}} \geq 1 \; \; \forall n=1,\dots,N\right\}. \] Now simply change the maximization to minimization, and drop the redundant variable \(\gamma\): \[ = \underset{\tilde{\boldsymbol{\beta}}}{\mathrm{argmin}}\, \left\{\|\tilde{\boldsymbol{\beta}}\|_2: y_n\boldsymbol{x}_n^T\tilde{\boldsymbol{\beta}} \geq 1 \; \; \forall n=1,\dots,N\right\}. \] The proof is concluded.

Notes:

An intercept may be added, and, as usual, it is not penalized.

Soft margin

The hard margin SVM is restricted to linearly separable data. We now introduce a soft margin (linear) SVM, which trades separability with \(\|\boldsymbol{\beta}\|_2\): \[ \hat{\boldsymbol{\beta}}_{\text{soft}}(\lambda) = \underset{\boldsymbol{\beta}}{\mathrm{argmin}}\, \sum_{n=1}^N \max(1-y_n\boldsymbol{x}_n^T\boldsymbol{\beta}, 0) + \lambda\|\boldsymbol{\beta}\|_2^2 \] This way, we may allow classification (or margin) mistakes for the purpose of reducing \(\|\boldsymbol{\beta}\|_2^2\). This brings us to the familiar form of minimizing loss plus penalty.

Connection to hard margin SVM:

To see how soft margin SVM relates to hard margin, add slack variable \(\xi_n\geq \max(1-y_n\boldsymbol{x}_n^T\beta, 0)\) and observe that \[ \hat{\boldsymbol{\beta}}_{\text{soft}}(\lambda) = \underset{\boldsymbol{\beta}}{\mathrm{argmin}}\, \|\boldsymbol{\beta}\|_2^2 + \frac{1}{\lambda}\sum_{n=1}^N \xi_n \; \; \; \; \text{subject to} \; \; \; \; y_n\boldsymbol{x}_n^T\beta\geq 1-\xi_n, \; \; \xi_n\geq 0 \; \; \forall n=1,\dots,N. \] We also divided by \(\lambda\). From this point we may intuitively believe that, for linearly separable data, \(\hat{\boldsymbol{\beta}}_{\text{hard}}\) is obtained as the limit of \(\hat{\boldsymbol{\beta}}_{\text{soft}}(\lambda)\) as \(\lambda\to 0^+\), since the slack variable \(\xi_1,\dots,\xi_N\) are pushed to zero.

Interpretation as penalty for margin mistakes:

The slack variable \(\xi_N\) is nonzero if we make a margin mistake on \((\boldsymbol{x}_n,y_n)\). It is greater than one if we make a classification mistake. The sum \(\sum_{n=1}^N\xi_N\) quantifies the total over all margin mistakes, which we can control by adjusting the penalty \(\lambda\).

Solving the optimization problem

Using convex optimization theory (we won’t show this), one may show that the soft margin SVM solution will satisfy \[ \hat{\boldsymbol{\beta}}_{\text{soft}} = \sum_{n=1}^N \alpha_n y_n\boldsymbol{x}_n \; \; \;\ ;\ \text{where} \; \; \; \; \alpha_n\ne 0 \; \; \text{if and only if} \; \; y_n\boldsymbol{x}_n^T\hat{\boldsymbol{\beta}}_{\text{soft}} \leq 1. \] This sum is usually sparse, meaning that the number of participating observations is much less that \(N\). The observation that are “on” the margin are referred to as “support vectors”.

Kernels

If we used an expressive feature transformation \(\varphi(\boldsymbol{x}_n)\in\mathbb{R}^{Q}\), the soft margin (non-linear!) SVM \(\hat{\boldsymbol{\beta}}_{\text{kernel}}\in\mathbb{R}^Q\) would satisfy \[ \hat{\boldsymbol{\beta}}_{\text{kernel}} = \sum_{n=1}^N \alpha_n y_n\varphi(\boldsymbol{x}_n). \] The prediction on a new observation would be \[ y= \text{sign}(\varphi(\boldsymbol{x})^T\hat{\boldsymbol{\beta}}_{\text{kernel}}) = \sum_{n=1}^N \alpha_ny_n \varphi(\boldsymbol{x})^T\varphi(\boldsymbol{x}_n). \]

Infinite transformation:

What if \(Q=\infty\)? There is no hope in computing the transformed feature vector \(\varphi(\boldsymbol{x})\), but close attention would reveal that only the inner product \[ \varphi(\boldsymbol{x})^T \varphi(\boldsymbol{x}_n) \] is required to compute the prediction. (The same is true for computing \(\alpha_n\).)

Reproducing kernel Hilbert spaces (RKHSs):

There is a rich theory on infinite-dimensional spaces for which inner products are easy to compute, and the soft margin SVM adapts well to RKHSs. (The same is true for linear regression!) Without getting into too much detail at this point, RKHSs use a kernel \(k(\cdot,\cdot)\) such that \[ \varphi(\boldsymbol{x})^T\varphi(\boldsymbol{z}) = k(\boldsymbol{x},\boldsymbol{z}). \] Examples of kernels:

Polynomial \(k(\boldsymbol{x},\boldsymbol{z})=(\boldsymbol{x}^T\boldsymbol{z}+ 1)^d\). (This is finite dimensional.)
Gaussian: \(k(\boldsymbol{x},\boldsymbol{z})=e^{-\|\boldsymbol{x}-\boldsymbol{z}\|_2^2}\).