Goals
- Give an example of the integrable Lipschitz conditions
- Show that zero–one loss does not satisfy the integrable Lipschitz conditions
- Define the Vapnik-Chervonenkis (VC) dimension
- State (but do not prove) a ULLN in terms of the VC dimension
Reading
These notes can be supplemented with Hastie, Tibshirani, and Friedman (2009) Chapter 7.9. The final bound is based on Wainwright (2019) Chapter 4.3, which is more advanced by provides a more complete treatment. Finally, if you want to go further, there are nice video tutorials in lectures 7 and 8 of this online course.
Examples of integrable Lipschitz loss
Recall that we proved a ULLN for functions of the form satisfying
where and is compact. We called these the ``integrable Lipschtiz’’ (IL) conditions. This is because a Lipschitz function of satisfies
for some constant , and the IL condition allows this to depend on , but in a way that is ``integrable’’ by an integrable function .
To use this result to control generalization error, we want to apply this to functions of the form
So we might take , a parametric learner , and
OLS loss
For example, taking squared loss and linear regression, we get
Let us assume that is compact. (One can show that can be safely taken to be compact with high probability as grows, but I will leave that as a potential homework problem.) We then have, by the triangle inequality and Cauchy–Schwartz,
where is finite because is compact. We can thus take
and see that quadratic loss satisfies the IL condition, and so a ULLN. (Though recall that we can show directly that this loss obeys a ULLN, as you have done in your homework.)
Bounded derivatives
In general, it suffices to show that the derivative is integrable by an integrable . This is because, for any and , we can define a path and write
giving
Therefore, if we can take
and show that , then the IL property follows.
Zero–one loss functions
Unfortunately, zero–one losses are not typically IL, since they depend non–smoothly on the classification function . For a simple example, take , (so it’s a classification problem), and . Consider the simple threshold classifier
Then consider classification error at for small and . We have:
Since is discontinuous in at , the function cannot be IL. A different technique is needed to control generalization error for zero–one loss.
VC dimension
A key tool in controlling generalization error for functions taking values in is ``Vapnik–Chervonenkis’’ (VC) dimension. The VC dimension is a measure of complexity of set of functions, so we might write for the VC dimension of the function class , where each member maps some domain to binary values . (The VC dimension can be defined for more general functions by reducing them to binary–valued functions; see Van der Vaart (2000) Chapter 19, for example.)
For a given set of points, , a function returns a binary vector . As ranges over , ranges over possible values of the possible binary vectors.
For example, take our threshold classifier from above. If and , then as ranges from to , goes from to (as passes ) to (as passes ). The value is impossible to produce with any and this , since , and so is classified as whenever is.
As we saw, for a given , the number of distinct values of may be more or less than the total number of possibly values . If is able to take all possible values as ranges over , then is said to ``shatter’’ . Intuitively, gets harder to shatter as it contains more and more points. For example, the threshold function above can shatter a single point, but cannot shatter two points. Also intuitively, more expressive function classes can shatter larger and larger sets of points.
For a given function class of functions taking values in , let be the largest number such that there exists a vector of length that can be shattered by . Equivalently, if is smallest number such that no can be shattered by , then .
Note that the VC dimension doesn’t mean you can shatter all sets of size . In fact, this is obviously impossible, since a set consisting of copies of the same value cannot be shattered, no matter how complex is. The VC dimension requires only that you can find some set of size that is shattered, but no such set .
We have already shown that the VC dimension of the boundary classifier is .
The passage from VC dimension to ULLNs is somewhat complex, and beyond the scope of this (undergraduate) course. An end–to–end treatment can be found in Wainwright (2019) chapter four, from which we simply state the result.
Let , and take values in . Then, by Wainwright (2019):
Specifically, this is a consequence of the following results from Wainwright (2019):
- Proposition 4.18 and Definition 4.13 (VC dimension implies polynomial discrimination)
- Lemma 4.14 and equation 4.24 (polynomial discrimination controls Rademacher complexity)
- Theorem 4.10 (Rademacher complexity controls the ULLN)
Note that this result is stronger than our asymptotic results, in the sense that it is a finite–sample exact bound, rather than an (otherwise uncontrolled) guarantee as . Finite–sample bounds are much easier to obtain when the function is bounded as a function of , as is the case for zero–one loss, but which is not typically the case for regression loss.
Exercise: Show that if a function class for has finite VC dimension, then the function class also has finite VC dimension. Conclude that a ULLN holds for zero–one loss when using a classifier with finite VC dimension.
References
Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer.
Van der Vaart, A. 2000. Asymptotic Statistics. Vol. 3. Cambridge university press.
Wainwright, M. 2019. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Vol. 48. Cambridge university press.