The Perceptron Algorithm
This lecture follows Hardt and Recht (2021), Chapter 3, with modifications.
Suppose we have binary classification data \((\boldsymbol{x}_1,y_1),\dots,(\boldsymbol{x}_N,y_N)\), where \(y_n\in\{-1,1\}\), and we want to find a linear decision boundary: \[ \hat{y}_n=\text{sign}(\boldsymbol{x}_n^T\hat{\boldsymbol{\beta}}). \] Such a vector \(\hat{\boldsymbol{\beta}}\) would be normal to a hyperplane that separates our predictions.
Examples: Logistic regression and linear discriminant analysis (LDA). Both make assumptions on the data generating process, and then proceed to doing optimal classification under those assumptions. Today, we will refrain from making strong assumptions, and describe an algorithm that is guaranteed to find a separating hyperplane on any linearly separable dataset. Meaning, if there exists a hyperplane \(\boldsymbol{\beta}^*\) that separates the true class labels: \[ y_n = \text{sign}(\boldsymbol{x}_n^T\boldsymbol{\beta}^*). \]
The perceptron algorithm, which achieves this, was invented in the 1950’s. It’s not actually used today in practice, but we cover it because it:
- Is a precursor to support vector machines (SVM) and to deep learning.
- Elegantly showcases modern themes of machine learning.
The perceptron algorithm
- Set \(\boldsymbol{\beta}=0\).
- Loop indefinitely:
- Draw \(n\in\{1,\dots,N\}\) at random.
- If \(y_n\boldsymbol{x}_n^T\boldsymbol{\beta}< 1\), set \(\boldsymbol{\beta}\mathrel{+}= y_n\boldsymbol{x}_n\).
Interpretation
The condition:
We call an observation that meets the condition a margin mistake. This happens if:
- It is incorrectly classified: \(y_n\boldsymbol{x}_n^T\boldsymbol{\beta}\leq 0\). For, behold that \[ y_n = \text{sign}(\boldsymbol{x}_n^T\boldsymbol{\beta}) \; \; \; \; \text{if and only if} \; \; \; \; y_n\boldsymbol{x}_n^T\boldsymbol{\beta}> 0. \]
- It is correctly classified but with too small a margin: \(0<y_n\boldsymbol{x}_n^T\boldsymbol{\beta}<1\).
A margin mistake that is not a classification mistake can be “corrected” simply by inflating the scale of \(\boldsymbol{\beta}\). So, without controling the norm of \(\boldsymbol{\beta}\) the margin doesn’t really say a lot. Controlling the norm of \(\boldsymbol{\beta}\) will be a feature of SVM, but we’re not there yet.
The update rule:
Makes \(\boldsymbol{\beta}\) more aligned with \(\boldsymbol{x}_n\) on which it made a margin mistake. And note that the “best” possible normal for \(\boldsymbol{x}_n\) is, up to scale, \(\boldsymbol{x}_n\) itself. “Best” here means with largest margin. Finding the hyperplane that separates the data with the largest possible margin will be a feature of SVM, but we’re not there yet.
The output:
At this point, we’re not even sure whether the algorithm converges, let alone whether its limit is any good.
Convergence
We make two assumptions about the observations:
- Standardized: \(\|\boldsymbol{x}_n\|_2=1\) (not actually needed, see Hardt and Recht (2021). This just simplifies notation).
- Linearly separable: there is a unit vector \(\boldsymbol{\beta}^*\) which separates the data with margin \(\gamma\): \[ y_n\boldsymbol{x}_n^T\boldsymbol{\beta}^*>\gamma>0 \; \; \; \; \text{for all } \; \; n=1,\dots,N. \] Now that the norm \(\|\boldsymbol{\beta}^*\|_2=1\) is controlled, the margin \(\gamma\) is meaningful. Any finite dataset that is linearly separable is linearly separable with a positive margin, so the appearance of the margin is no news. Its value, however, in particular the value of the largest margin \(\gamma\) which can achieved on the data, is used to bound the amount of margin mistakes the algorithm will do in training, before it converges.
Lemma: The algorithm makes at most \(\frac{3}{\gamma^2}\) margin mistakes during training.
Proof of lemma: Skipped for now.
So:
- The algorithm converges after finitely many steps to a \(\hat{\boldsymbol{\beta}}\) that makes no margin mistakes. Call the limit \(\hat{\boldsymbol{\beta}}\) which we take to be the output of the algorithm.
- The number of margin mistakes made during training depends on the margin, but not on \(N,P\) (if \(\boldsymbol{x}_n\in\mathbb{R}^P\)).
- We did not need any modeling assumptions, only that the data are linearly separable.
Question: If the data are not linearly separable, will the algorithm converge, and to what?
Answer: For you to think about.
Generalization
We now understand that the algorithm converges to a separating hyperplane, granted that such a one exists. The question now becomes whether the separating hyperplane it converges to is any good, in terms of generalizing well to new data. We’ll see that any separating is good, using a uniform LLN bound. Then we’ll see that the particular one the algorithm converges to is especially good.
Consider a test observation \((\boldsymbol{x},y)\), and we want to bound the expected test error, \[ \underbrace{\mathbb{P}_{\,}\left(y\ne\text{sign}(\boldsymbol{x}^T\hat{\boldsymbol{\beta}})\Big|\hat{\boldsymbol{\beta}}\right)}_{=\mathscr{L}(\hat{\boldsymbol{\beta}})=\mathbb{E}_{\,}\left[I(y\ne\text{sign}(\boldsymbol{x}^T\hat{\boldsymbol{\beta}}))|\hat{\boldsymbol{\beta}}\right]}. \] We have to make an assumption about how data are generated, otherwise there is no hope that training data will tell anything on test data.
Assumption:
- \((\boldsymbol{x}_1,y_1),\dots,(\boldsymbol{x}_N,y_N),(\boldsymbol{x},y)\) IID.
- Standardized: \(\|\boldsymbol{x}\|_2=1\) with probability one.
- Linear separability, with a margin: there is a unit vector \(\boldsymbol{\beta}^*\) for which, with probability one, \[ y\boldsymbol{x}^T\beta^* > \gamma > 0. \] This means that the same margin that applies to training data will apply to test data. This is slightly harsh and can be relaxed, see Hardt and Recht (2021). The latter two assumptions are not required for the ULLN bound.
Via ULLN
Claim: VC dimension of linear separators is \(P+1\), if \(\boldsymbol{x}_n\in\mathbb{R}^P\).
Proof: In homework.
Using the lecture on VC dimension, we are guaranteed with probability at least \(1-e^{-N\delta^2/2}\) that \[ \bigg|\underbrace{\frac{1}{N}\sum_{n=1}^N I(y_n\ne\text{sign}(\boldsymbol{x}_n^T\hat{\boldsymbol{\beta}}))}_{=\hat{\mathscr{L}}(\hat{\boldsymbol{\beta}})=0 \text{ by lemma}} - \underbrace{\mathbb{P}_{\,}\left(y\ne\text{sign}(\boldsymbol{x}^T\hat{\boldsymbol{\beta}})|\hat{\boldsymbol{\beta}}\right)}_{=\mathscr{L}(\hat{\boldsymbol{\beta}})}\bigg| \leq \sqrt{\frac{(P+1)\log(N+1)}{N}} + \delta \] Meaning, with high probability we obtained a classifier \(\hat{\boldsymbol{\beta}}\) whose expected test error is very small for large \(N\). This comes from a worst-case bound for any linear separator, so we don’t need to know anything about the particular separating hyperplane our algorithm converged to.
Via direct analysis
Theorem: \(\underbrace{\mathbb{P}_{\,}\left(y\ne\text{sign}(\boldsymbol{x}^T\hat{\boldsymbol{\beta}})\right)}_{\mathbb{E}_{\,}\left[\mathscr{L}(\hat{\boldsymbol{\beta}})\right]}\leq \frac{1}{N+1}\cdot\frac{3}{\gamma^2}\).
Comparison between ULLN and direct analysis:
- ULLN: with high probability, \(\mathscr{L}(\hat{\boldsymbol{\beta}})\lesssim \frac{1}{\sqrt{N}}\).
- Direct analysis: \(\mathbb{E}_{\,}\left[\mathscr{L}(\hat{\boldsymbol{\beta}})\right]\lesssim \frac{1}{N}\).
Without knowing anything about separating hyperplane, beside the fact that it separates the training data, the ULLN approach shows that it generalizes well. We get a different, arguably stronger result by analyzing the particular separating hyperplane the algorithm converges to.
Proof of theorem
The proof utilizes the fact that the number of mistakes the algorithm makes during training is bounded, irrespectively of the number of observations. That means that we can add more observations to the training dataset (particularly, the test observation), without changing much. We call this a “stability” argument.
Proof of theorem: Combine both train and test observations into a single dataset \(\mathcal{D}=\{(\boldsymbol{x}_1,y_1),\dots,(\boldsymbol{x}_N,y_N),(\boldsymbol{x},y)\}\), and train a perceptron \(\hat{\boldsymbol{\beta}}^{\mathcal{D}}\) on the combined dataset \(\mathcal{D}\). By the lemma, it makes at most \(\frac{3}{\gamma^2}\) margin mistakes on observations in \(\mathcal{D}\) throughout training, out of a total \(N+1\) observations. How likely it is to have made a margin mistake on the test observation \((\boldsymbol{x},y)\)? Since the observations are IID, they’re permutation invariant, so it can be no more than \[ \mathbb{P}_{\,}\left(\text{mistake on }(\boldsymbol{x},y)\text{ during training}\right)\leq \frac{1}{N+1}\cdot\frac{3}{\gamma^2}, \] which is the number of margin mistakes divided by the number of observations.
To make this argument more formal - the subset of observations on which \(\hat{\boldsymbol{\beta}}^{\mathcal{D}}\) makes a margin mistake during training is a random subset of \(\{1,\dots,N+1\}\). If its size is less that \(\frac{3}{\gamma^2}\), add random observations until its size reaches \(\frac{3}{\gamma^2}\) (assume that number is an integer). The IID assumption implies exchangeability, which means that the random subset has uniform distribution over all subsets of \(\{1,\dots,N+1\}\) of size \(\frac{3}{\gamma^2}\). What is the probability that \(N+1\) is in the random subset? Consider the equivalent setting where the subset is fixed at \(\{1,\dots,\frac{3}{\gamma^2}\}\), and the observations are shuffled. The probability that the \((N+1)\)’th observation lands in the subset is \(\frac{1}{N+1}\cdot\frac{3}{\gamma^2}\).
Now for the key part - notice that if none of the margin mistakes made during training of \(\hat{\boldsymbol{\beta}}^{\mathcal{D}}\) were on \((\boldsymbol{x},y)\):
- Neither \(\boldsymbol{x}\) nor \(y\) play any role in the algorithm
- So \((\boldsymbol{x},y)\) could have been dropped from the combined dataset \(\mathcal{D}\) without changing anything
- So \(\hat{\boldsymbol{\beta}}=\hat{\boldsymbol{\beta}}^{\mathcal{D}}\)
- So \(\hat{\boldsymbol{\beta}}\) also doesn’t mistake on \((\boldsymbol{x},y)\)
So, \[ \mathbb{P}_{\,}\left(y\ne\text{sign}(\boldsymbol{x}^T\hat{\boldsymbol{\beta}})\right) \leq \mathbb{P}_{\,}\left(\text{mistake on }(\boldsymbol{x},y)\text{ during training}\right) \leq \frac{1}{N+1}\cdot\frac{3}{\gamma^2}. \]
To make this argument more formal - consider \(\hat{\boldsymbol{\beta}}^{\text{dummy}}\) which is the output of a perceptron on \(\mathcal{D}\), with the tweak that when the random index is \(N+1\), we skip to the next iteration without checking the test observation \((\boldsymbol{x},y)\). Conditioned on the event \(A\) that \(\hat{\boldsymbol{\beta}}^{\mathcal{D}}\) made no mistake on the test observation during training, \(\hat{\boldsymbol{\beta}}^{\text{dummy}}=\hat{\boldsymbol{\beta}}^{\mathcal{D}}\). Now, its easy to see that \(\hat{\boldsymbol{\beta}}\) with \(\hat{\boldsymbol{\beta}}^{\text{dummy}}\) have the same distribution.
The proof is done.
Interpretation as loss minimization
We’ll interpret the perceptron algorithm as running a form of stochastic gradient descent to minimize hinge loss.
Hinge loss:
For \(y\in\{-1,1\}\) and \(\hat{y}\in\mathbb{R}\), \[ \mathscr{L}(y, \hat{y}) = \max(1-y\hat{y}, 0) = \max(1-y\boldsymbol{x}^T\hat{\boldsymbol{\beta}}, 0) \] This loss encourages correct classification with a margin. If the observation was correctly classified (\(y=\text{sign}(\hat{y})\)), we’ll either zero loss if the margin was large enough (\(y\hat{y}\geq 1\)), or else we’ll pay a loss of up to one. If the observation was incorrectly classified, we’ll pay a loss larger than one.
Stochastic gradient descent:
Move in the direction that minimizes the gradient of the loss, \[ \nabla_\boldsymbol{\beta}\max(1-y_n\boldsymbol{x}_n^T\beta, 0) = \begin{cases} 0 & \text{if } y_n\boldsymbol{x}_n^T\beta \geq 1 \\ -y_n\boldsymbol{x}_n & \text{otherwise}. \end{cases} \] The stochastic part comes from minimizing the gradient for a random observation, rather than on average across all observation at every iteration.
Proof of lemma
Proof: Let \(\boldsymbol{\beta}_t\) denote the normal vector at the \(t\)’th iteration. If at iteration \(t\) the algorithm chose \(n\) and made a mistake, i.e. \(y_n\boldsymbol{x}_n^T\boldsymbol{\beta}_{t}<1\), then \[ \|\boldsymbol{\beta}_{t+1}\|_2^2 = \|\boldsymbol{\beta}_{t} + y_n\boldsymbol{x}_n\|_2^2 = \|\boldsymbol{\beta}_{t}\|_{2}^2 + 2y_n\boldsymbol{x}_n^T\boldsymbol{\beta}_t + \|\boldsymbol{x}_n\|_2^2 \leq \|\boldsymbol{\beta}_{t}\|_{2}^2 + 3. \] The observation here is that when there is a margin error, this means that \(\boldsymbol{\beta}_t\) and \(\boldsymbol{x}_n\) are relatively orthogonal, so \(\boldsymbol{\beta}_{t+1}\) grows by at most a constant. If no mistake was made, the normal vector remains unchanged. Summing over mistakes, we get that \[ \|\boldsymbol{\beta}_{t+1}\|_2^2 \leq 3m_t, \] where \(m_t\) is the number of mistakes made up to and including the \(t\)’th iteration. Now, considering the unit vector \(\boldsymbol{\beta}^*\) which it is assumed that \(y_n\boldsymbol{x}_n^T\boldsymbol{\beta}^*>\gamma\) with probability one, \[ (\boldsymbol{\beta}_{t+1}-\boldsymbol{\beta}_t)^T\boldsymbol{\beta}^* = y_n\boldsymbol{x}_n^T\boldsymbol{\beta}^* \geq \gamma, \] given that a mistake was made at the \(t\)’th iteration. Using a telescoping series, \[ \|\boldsymbol{\beta}_{t+1}\|_2 \geq \boldsymbol{\beta}_{t+1}^T\boldsymbol{\beta}^* = \sum_{k=0}^t(\boldsymbol{\beta}_{k+1}-\boldsymbol{\beta}_{k})^T\boldsymbol{\beta}^* \geq m_t\gamma. \] We thus have \[ 3m_t\leq \|\boldsymbol{\beta}_{t+1}\|_2^2 \leq \gamma^2m_t^2, \] and the proof is concluded.