Generalization and uniform laws
\[ % Generalization error \def\err{\textrm{Error}} \]
Goals
- Prove a ULLN for sufficiently smooth parametric function classes
- Introduce a covering argument to do so
Reading
This is a simplified exposition of Van der Vaart (2000) Example 19.7. One can find a different perspective on the same class of problems in Keener (2010) Chapter 9. A more advanced approach that is finite–sample but limited to bounded functions can be found in Wainwright (2019) Chapter 4.
Uniform laws of large numbers for smooth function classes
Recall that we want to find conditions under which a uniform law of large numbers (ULLN) holds:
\[ \begin{aligned} \left|\hat{\mathscr{L}}(\hat{f}) - \mathscr{L}(\hat{f})\right| \le{}& \sup_{f\in \mathcal{F}} \left|\hat{\mathscr{L}}(f) - \mathscr{L}(f)\right| \\={}& \sup_{f\in \mathcal{F}} \left| \frac{1}{N} \sum_{n=1}^N\mathscr{L}(f(x_n), y_n) - \mathbb{E}_{new}\left[\mathscr{L}(f(x_\mathrm{new}), y_\mathrm{new})\right] \right|. \end{aligned} \]
Note that the ULLN depends on the function \(f\) only through \(f\mapsto \mathscr{L}(f(x), y)\). And so the above ULLN is a special case of the more general problem of controlling
\[ \sup_{f\in \mathcal{F}} \left|\frac{1}{N} \sum_{n=1}^Nf(z_n) - \mathbb{E}_{\,}\left[f(z)\right]\right| =\sup_{f\in \mathcal{F}} \err(f, N) \]
where we define
\[ \err(f,N) := \left|\frac{1}{N} \sum_{n=1}^Nf(z_n) - \mathbb{E}_{\,}\left[f(z)\right]\right|. \]
Note that \(\err(f,N)\) is random, since it depends on \(z_1, \ldots, z_N\).
Some examples
Although we will develop an abstract result, it’s worth keeping in mind a few parametric examples. The linear case is functions of the form \(\mathcal{F}= \{ f(x) = \boldsymbol{\beta}^\intercal x\textrm{ for some } \boldsymbol{\beta}\}\). An estimator of a class probability from logistic regression might be \(\mathcal{F}= \{ f(x) = \textrm{Logit}^{-1}(\boldsymbol{\beta}^\intercal x) \textrm{ for some } \boldsymbol{\beta}\}\). Similarly, a \(0--1\) valued classifer would be \(\mathcal{F}= \{ f(x) = \mathrm{I}\left(\boldsymbol{\beta}^\intercal x> \alpha\right) \textrm{ for some } \boldsymbol{\beta}, \alpha \}\). Finally, you might take \(\mathcal{F}= \{ \log p(\cdot | \theta) \textrm{ for some parameter } \theta \}\) for a probability distribution \(p(z| \theta)\), corresponding to maximum likelihood.
A finite set of functions
We first note that the problem is easy if \(\mathcal{F}\) is finite, i.e., \(\mathcal{F}= \{ f_k \textrm{ for }k=1,\ldots,K < \infty \}\), where each \(f_k\) satisfies \(\mathbb{E}_{\,}\left[|f_k(z)|\right] \infty\). In that case,
\[ \sup_{f\in \mathcal{F}} \err(f,N) = \max_{k=1,\ldots,K} \err(f_k, N). \]
Since \(\err(f_k, N) \rightarrow 0\) by the ordinary LLN, it follows that \(\max_{k=1,\ldots,K} \err(f_k, N) \rightarrow 0\) as well, since \(K\) is finite.
Specifically, \(\err(f_k, N) \rightarrow 0\) means that, for any \(\varepsilon > 0\) and \(\delta > 0\), there exists an \(N_k\) such that
\[ N > N_k \Rightarrow \mathbb{P}_{\,}\left(\err(f_k, N) \ge \varepsilon\right) \le \delta. \]
Take \(N > \max_{k =1,\ldots,K} N_k\). Then
\[ \begin{aligned} \mathbb{P}_{\,}\left(\max_{k=1,\ldots,K} \err(f_k, N) \ge \varepsilon \right) ={}& \mathbb{P}_{\,}\left(\bigcup_{k=1}^K \err(f_k, N) \ge \varepsilon\right) \\\le{}& \sum_{k=1}^K \mathbb{P}_{\,}\left(\err(f_k, N) \ge \varepsilon\right) \\\le{}& K \delta. \end{aligned} \]
Since we can tak \(\delta = \delta' / K\) for any \(\delta'\), we have proven that \(\sup_{f\in \mathcal{F}} \err(f,N) \rightarrow 0\) in probability.
A covering
The previous trick cannot be applied when \(\mathcal{F}\) is infinite — as is the case in any of our examples, which are parameterized by a continuous parameter and so contain an infinite class of functions.
A classical approach is to finite a finite “cover” of \(\mathcal{F}\). There are many ways to do so, but here we will focus on a particular example (Van der Vaart (2000) Example 19.7). Suppose that: \(\mathcal{F}= \{ f(\cdot; \theta) \textrm{ for some }\theta \in \Omega \}\). Assume:
Assume that \(\Omega\) is compact, and assume that, for each \(z\), \[ \left|f(z; \theta) - f(z; \theta')\right| \le{} M(z) \left\Vert\theta - \theta'\right\Vert, \] where \(M(z) \ge 0\) and \(\mathbb{E}_{\,}\left[M(z)\right] < \infty\).
This means that the depednence of \(f(z; \theta)\) can be decomposed into a smooth part depending on \(\theta\), and an integral part depending on \(z\). When this is true, for a fixed \(\gamma > 0\), we can always find a set of \(K(\gamma)\) parameters \(\theta_k\) (depending on \(\gamma\)) such that
\[ \textrm{For each }\theta \in \Omega\textrm{, there exists }\theta_k \textrm{ such that }\left\Vert\theta_k - \theta\right\Vert < \gamma. \]
The smaller \(\gamma\) is, the larger \(K(\gamma)\) must be. That there exist such \(\theta_k\) is guaranteed by the compactness of \(\Omega\), which is a crucial assumption.
Recall that if \(\Omega\) is compact, then every open cover has a finite sub–cover. Using this fact, prove that \(K(\gamma) < \infty\) for each \(\gamma > 0\).
Given this “covering,” for a given \(\theta\) and its associated \(\theta_k\) with \(\left\Vert\theta - \theta_k\right\Vert < \gamma\), we can identify \(f_k(\cdot) = f(\cdot; \theta_k)\).
Using the tiling to produce a ULLN
\[ \begin{aligned} \err(f, N) ={}& \left|\frac{1}{N} \sum_{n=1}^Nf(z_n; \theta) - \mathbb{E}_{\,}\left[f(z; \theta)\right]\right| \\={}& \left|\frac{1}{N} \sum_{n=1}^Nf(z_n; \theta) - f(z_n; \theta_k) - \mathbb{E}_{\,}\left[f(z; \theta) - f(z; \theta_k)\right] + f(z_n; \theta_k) - \mathbb{E}_{\,}\left[f(z; \theta_k)\right]\right| \\ \le{}& \frac{1}{N} \sum_{n=1}^N\left|f(z_n; \theta) - f(z_n; \theta_k)\right| + \mathbb{E}_{\,}\left[\left|f(z; \theta) - f(z; \theta_k)\right|\right] + \err(f_k, N) \\ \le{}& \left( \frac{1}{N} \sum_{n=1}^NM(z_n) + \mathbb{E}_{\,}\left[M(z)\right] \right) \left\Vert\theta - \theta_k\right\Vert + \err(f_k, N) \\ \le{}& \left( \frac{1}{N} \sum_{n=1}^NM(z_n) + \mathbb{E}_{\,}\left[M(z)\right] \right) \gamma + \max_{k=1,\ldots,K} \err(f_k, N). \end{aligned} \]
The right hand side now does not depend on \(f\), so
\[ \sup_{f\in \mathcal{F}} \err(f, N) \le \left( \frac{1}{N} \sum_{n=1}^NM(z_n) + \mathbb{E}_{\,}\left[M(z)\right] \right) \gamma + \max_{k=1,\ldots,K} \err(f_k, N). \]
For any fixed \(\gamma\), \(\max_{k=1,\ldots,K} \err(f_k, N) \rightarrow 0\), as shown above. So since \(\frac{1}{N} \sum_{n=1}^NM(z_n) \rightarrow \mathbb{E}_{\,}\left[M(z)\right]\), we have that, for any fixed \(\gamma\),
\[ \left( \frac{1}{N} \sum_{n=1}^NM(z_n) + \mathbb{E}_{\,}\left[M(z)\right] \right) \gamma + \max_{k=1,\ldots,K} \err(f_k, N) \rightarrow 2 \mathbb{E}_{\,}\left[M(z)\right] \gamma. \]
Since \(\gamma\) can be made arbitrarily small (at the cost of requiring even greater \(N\) to make \(\max_{k=1,\ldots,K} \err(f_k, N)\) small), for any \(\varepsilon\) and \(\delta\) we can choose \(\gamma = \varepsilon / 2 \mathbb{E}_{\,}\left[M(z)\right]\) and find an \(N\) large enough that
\[ \mathbb{P}_{\,}\left(\sup_{f\in \mathcal{F}} \err(f, N) \ge \varepsilon\right) \le \delta. \]
It follows that \(\sup_{f\in \mathcal{F}} \err(f, N) \rightarrow 0\) in probability.
General covering arguments
Note that the above argument relied on having a covering of the parameter space to give a covering of the function space. The key idea is the covering, not the parameters! Many ULLNs for generic function classes proceed by studying the number of functions it takes to “cover” the function class \(\mathcal{F}\) using some notion of distance between functions, rather than the distance between parameters.