Generalization and uniform laws
\[ % Generalization error \def\err{\textrm{Error}} \]
Goals
- Prove a ULLN for sufficiently smooth parametric function classes
- Introduce a covering argument to do so
Reading
This is a simplified exposition of Van der Vaart (2000) Example 19.7. One can find a different perspective on the same class of problems in Keener (2010) Chapter 9. A more advanced approach that is finite-sample but limited to bounded functions can be found in Wainwright (2019) Chapter 4.
Uniform laws of large numbers for smooth function classes
Recall that we want to find conditions under which a uniform law of large numbers (ULLN) holds:
\[ \begin{aligned} \left|\hat{\mathscr{L}}(\hat{f}) - \mathscr{L}(\hat{f})\right| \le{}& \sup_{f\in \mathcal{F}} \left|\hat{\mathscr{L}}(f) - \mathscr{L}(f)\right| \\={}& \sup_{f\in \mathcal{F}} \left| \frac{1}{N} \sum_{n=1}^N\mathscr{L}(f(x_n), y_n) - \mathbb{E}_{new}\left[\mathscr{L}(f(x_\mathrm{new}), y_\mathrm{new})\right] \right|. \end{aligned} \]
Note that the ULLN depends on the function \(f\) only through \(f\mapsto \mathscr{L}(f(x), y)\). Given this, we can define
\[ \begin{aligned} z_n :={}& (x_n, y_n)\\ g(z_n; f) :={}& \mathscr{L}(f(x_n), y_n)\\ \mathcal{G}:={}& \{ g: g(z) = g(z; f) \textrm{ for some }f\in \mathcal{F}\}. \end{aligned} \]
With these transformations, we can rewrite our generalization error in the following more general form:
\[ \sup_{g\in \mathcal{G}} \left|\frac{1}{N} \sum_{n=1}^Ng(z_n) - \mathbb{E}_{\vphantom{}}\left[g(z)\right]\right| =\sup_{g\in \mathcal{G}} \err(g, N) \]
where we define
\[ \err(g,N) := \left|\frac{1}{N} \sum_{n=1}^Ng(z_n) - \mathbb{E}_{\vphantom{}}\left[g(z)\right]\right|. \]
Note that \(\err(g,N)\) is random, since it depends on \(z_1, \ldots, z_N\).
A finite set of functions
We first note that the problem is easy if \(\mathcal{G}\) is finite, i.e., \(\mathcal{G}= \{ g_k \textrm{ for }k=1,\ldots,K < \infty \}\), where each \(g_k\) satisfies \(\mathbb{E}_{\vphantom{}}\left[|g_k(z)|\right] \infty\). In that case,
\[ \sup_{g\in \mathcal{G}} \err(f,N) = \max_{k=1,\ldots,K} \err(g_k, N). \]
Since \(\err(g_k, N) \rightarrow 0\) by the ordinary LLN, it follows that \(\max_{k=1,\ldots,K} \err(g_k, N) \rightarrow 0\) as well, since \(K\) is finite.
Specifically, \(\err(g_k, N) \rightarrow 0\) means that, for any \(\varepsilon > 0\) and \(\delta > 0\), there exists an \(N_k\) such that
\[ N > N_k \Rightarrow \mathbb{P}_{\,}\left(\err(g_k, N) \ge \varepsilon\right) \le \delta. \]
Take \(N > \max_{k =1,\ldots,K} N_k\). Then
\[ \begin{aligned} \mathbb{P}_{\,}\left(\max_{k=1,\ldots,K} \err(g_k, N) \ge \varepsilon \right) ={}& \mathbb{P}_{\,}\left(\bigcup_{k=1}^K \err(g_k, N) \ge \varepsilon\right) \\\le{}& \sum_{k=1}^K \mathbb{P}_{\,}\left(\err(g_k, N) \ge \varepsilon\right) \\\le{}& K \delta. \end{aligned} \]
Since we can tak \(\delta = \delta' / K\) for any \(\delta'\), we have proven that \(\sup_{g\in \mathcal{G}} \err(f,N) \rightarrow 0\) in probability.
A covering
The previous trick cannot be applied when \(\mathcal{G}\) is infinite — as is the case in any of our examples, which are parameterized by a continuous parameter and so contain an infinite class of functions.
A classical approach is to finite a finite “cover” of \(\mathcal{G}\). There are many ways to do so, but here we will focus on a particular example (Van der Vaart (2000) Example 19.7). Suppose that: \(\mathcal{G}= \{ f(\cdot; \theta) \textrm{ for some }\theta \in \Omega \}\). Assume:
Assume that \(\Omega\) is compact, and assume that, for each \(z\), \[ \left|g(z; \theta) - g(z; \theta')\right| \le{} M(z) \left\Vert\theta - \theta'\right\Vert, \] where \(M(z) \ge 0\) and \(\mathbb{E}_{\vphantom{}}\left[M(z)\right] < \infty\).
This means that the depednence of \(g(z; \theta)\) can be decomposed into a smooth part depending on \(\theta\), and an integral part depending on \(z\). When this is true, for a fixed \(\gamma > 0\), we can always find a set of \(K(\gamma)\) parameters \(\theta_k\) (depending on \(\gamma\)) such that
\[ \textrm{For each }\theta \in \Omega\textrm{, there exists }\theta_k \textrm{ such that }\left\Vert\theta_k - \theta\right\Vert < \gamma. \]
The smaller \(\gamma\) is, the larger \(K(\gamma)\) must be. That there exist such \(\theta_k\) is guaranteed by the compactness of \(\Omega\), which is a crucial assumption.
Recall that if \(\Omega\) is compact, then every open cover has a finite sub–cover. Using this fact, prove that \(K(\gamma) < \infty\) for each \(\gamma > 0\).
Given this “covering,” for a given \(\theta\) and its associated \(\theta_k\) with \(\left\Vert\theta - \theta_k\right\Vert < \gamma\), we can identify \(g_k(\cdot) = g(\cdot; \theta_k)\).
Using the tiling to produce a ULLN
\[ \begin{aligned} \err(g, N) ={}& \left|\frac{1}{N} \sum_{n=1}^Ng(z_n; \theta) - \mathbb{E}_{\vphantom{}}\left[g(z; \theta)\right]\right| \\={}& \left|\frac{1}{N} \sum_{n=1}^Ng(z_n; \theta) - g(z_n; \theta_k) - \mathbb{E}_{\vphantom{}}\left[g(z; \theta) - g(z; \theta_k)\right] + g(z_n; \theta_k) - \mathbb{E}_{\vphantom{}}\left[g(z; \theta_k)\right]\right| \\ \le{}& \frac{1}{N} \sum_{n=1}^N\left|g(z_n; \theta) - g(z_n; \theta_k)\right| + \mathbb{E}_{\vphantom{}}\left[\left|g(z; \theta) - g(z; \theta_k)\right|\right] + \err(g_k, N) \\ \le{}& \left( \frac{1}{N} \sum_{n=1}^NM(z_n) + \mathbb{E}_{\vphantom{}}\left[M(z)\right] \right) \left\Vert\theta - \theta_k\right\Vert + \err(g_k, N) \\ \le{}& \left( \frac{1}{N} \sum_{n=1}^NM(z_n) + \mathbb{E}_{\vphantom{}}\left[M(z)\right] \right) \gamma + \max_{k=1,\ldots,K} \err(g_k, N). \end{aligned} \]
The right hand side now does not depend on \(g\), so
\[ \sup_{g\in \mathcal{G}} \err(g, N) \le \left( \frac{1}{N} \sum_{n=1}^NM(z_n) + \mathbb{E}_{\vphantom{}}\left[M(z)\right] \right) \gamma + \max_{k=1,\ldots,K} \err(g_k, N). \]
For any fixed \(\gamma\), \(\max_{k=1,\ldots,K} \err(g_k, N) \rightarrow 0\), as shown above. So since \(\frac{1}{N} \sum_{n=1}^NM(z_n) \rightarrow \mathbb{E}_{\vphantom{}}\left[M(z)\right]\), we have that, for any fixed \(\gamma\),
\[ \left( \frac{1}{N} \sum_{n=1}^NM(z_n) + \mathbb{E}_{\vphantom{}}\left[M(z)\right] \right) \gamma + \max_{k=1,\ldots,K} \err(g_k, N) \rightarrow 2 \mathbb{E}_{\vphantom{}}\left[M(z)\right] \gamma. \]
Since \(\gamma\) can be made arbitrarily small (at the cost of requiring even greater \(N\) to make \(\max_{k=1,\ldots,K} \err(g_k, N)\) small), for any \(\varepsilon\) and \(\delta\) we can choose \(\gamma = \varepsilon / 2 \mathbb{E}_{\vphantom{}}\left[M(z)\right]\) and find an \(N\) large enough that
\[ \mathbb{P}_{\,}\left(\sup_{g\in \mathcal{G}} \err(g, N) \ge \varepsilon\right) \le \delta. \]
It follows that \(\sup_{g\in \mathcal{G}} \err(g, N) \rightarrow 0\) in probability.
General covering arguments
Note that the above argument relied on having a covering of the parameter space to give a covering of the function space. The key idea is the covering, not the parameters! Many ULLNs for generic function classes proceed by studying the number of functions it takes to “cover” the function class \(\mathcal{G}\) using some notion of distance between functions, rather than the distance between parameters.
Proving the Integrable Lipschitz condition
It may seem mysterious how to show that a problem satisfies the IL condition directly. An easy way is to bound the derivative with respect to \(\theta\) in terms of a function of \(z\) with finite expectation. Specifically,
Assume that \(\Omega\) is compact. Assume that, for each \(z\), \[ \frac{\partial g(z; \theta)}{\partial \theta} \textrm{ exists and } \left\Vert\frac{\partial g(z; \theta)}{\partial \theta}\right\Vert_2^2 \le M(z) \] where \(M(z) \ge 0\) and \(\mathbb{E}_{\vphantom{}}\left[M(z)\right] < \infty\). Then \(g(z; \theta)\) satisfies the integrable Lipschitz condition.
The proof goes via the fundamental theroem of calculus. For any \(\theta\) and \(\theta'\) in \(\Omega\), let \(\theta(t) = \theta' t+ (1 - t) \theta\) for \(t\in [0,1]\) denote a parameterized path from \(\theta\) to \(\theta'\). Then, for any \(z\),
\[ \begin{aligned} \left|g(z; \theta') - g(z; \theta)\right| ={}& \left|g(z; \theta(1)) - g(z; \theta(0))\right| \\={}& \left|\int_{0}^1 \frac{\partial g(z; \theta(t))}{\partial t} dt\right| & \textrm{(by the fundamental theorem of calculus)} \\={}& \left|\int_{0}^1 \left. \frac{\partial g(z; \theta)}{\partial \theta^\intercal} \right|_{\theta = \theta(t)} \frac{\partial \theta(t)}{\partial t} dt\right| & \textrm{(by the chain rule)} \\={}& \left|\left( \int_{0}^1 \left. \frac{\partial g(z; \theta)}{\partial \theta^\intercal} \right|_{\theta = \theta(t)} dt\right) (\theta' - \theta)\right| & \textrm{(by the definition of $\theta(t)$)} \\\le{}& \left\Vert\sup_{t\in [0,1]} \left. \frac{\partial g(z; \theta)}{\partial \theta^\intercal} \right|_{\theta = \theta(t)} \right\Vert_2 \left\Vert\theta' - \theta\right\Vert_2 & \textrm{(by Cauchy-Schwarz)}\\\le{}& M(z) \left\Vert\theta' - \theta\right\Vert_2. & \textrm{(by assumption)} \end{aligned} \]
Often it’s easier to show that the norm of the partial derivative is bounded by a function of \(z\) (that is, of \(x\) and \(y\)) with finite expectation than to prove the stronger IL condition.
OLS loss example
For example, taking squared loss and linear regression, we get
\[ g(z, \theta) = \mathscr{L}(f(\boldsymbol{x}), y) = (\boldsymbol{x}^\intercal\theta - y)^2 = y^2 - 2 y\boldsymbol{x}^\intercal\theta + \mathrm{trace}\left(\boldsymbol{x}\boldsymbol{x}^\intercal\theta \theta^\intercal\right). \]
Let us assume that \(\Omega\) is compact. (One can show that \(\Omega\) can be safely taken to be compact with high probability as \(N\) grows, but I will leave that as a potential homework problem.)
Recall that we can show directly in your homework that this loss obeys a ULLN. We can also the IL condition directly by the triangle inequality and Cauchy–Schwartz:
\[ \begin{aligned} g(z, \theta) - g(z, \theta') ={}& - 2 y\boldsymbol{x}^\intercal(\theta - \theta') + \mathrm{trace}\left(\boldsymbol{x}\boldsymbol{x}^\intercal(\theta \theta^\intercal- \theta' {\theta'}^\intercal)\right). \\={}& - 2 y\boldsymbol{x}^\intercal(\theta - \theta') + \mathrm{trace}\left(\boldsymbol{x}\boldsymbol{x}^\intercal(\theta + \theta') (\theta - \theta')^\intercal\right) \quad\Rightarrow\\ \left|g(z, \theta) - g(z, \theta')\right| \le{}& 2 \left|y\right| \left\Vert\boldsymbol{x}\right\Vert_2 \left\Vert\theta - \theta'\right\Vert_2 + \left\Vert\boldsymbol{x}\right\Vert^2_2 \left\Vert\theta + \theta'\right\Vert_2 \left\Vert\theta - \theta'\right\Vert_2 \\\le{}& \left( 2 \left|y\right| \left\Vert\boldsymbol{x}\right\Vert_2 + \left\Vert\boldsymbol{x}\right\Vert^2_2 \left\Vert\theta + \theta'\right\Vert_2 \right) \left\Vert\theta - \theta'\right\Vert_2 \\\le{}& 2 \left( \left|y\right| \left\Vert\boldsymbol{x}\right\Vert_2 + \left\Vert\boldsymbol{x}\right\Vert^2_2 \sup_{\theta \in \Omega} \left\Vert\theta\right\Vert_2 \right) \left\Vert\theta - \theta'\right\Vert_2 \end{aligned} \]
where \(\sup_{\theta \in \Omega} \left\Vert\theta\right\Vert\) is finite because \(\Omega\) is compact, and the operator norm of \(\boldsymbol{x}\boldsymbol{x}\intercal\) is equal to \(\left\Vert\boldsymbol{x}\right\Vert_2^2\). We can thus take
\[ M(z) = 2 \left( \left|y\right| \left\Vert\boldsymbol{x}\right\Vert_2 + \left\Vert\boldsymbol{x}\right\Vert^2_2 \sup_{\theta \in \Omega} \left\Vert\theta\right\Vert_2 \right), \]
and see that quadratic loss satisfies the IL condition, and so a ULLN, under standard conditions on the moments of \(\boldsymbol{x}\) and \(y\).
Alternatively, we can prove the IL condition by simply taking the derivative,
\[ \begin{aligned} \left\Vert\frac{\partial g(z, \theta)}{\partial \theta}\right\Vert_2 ={}& \left\Vert-2 y\boldsymbol{x}+ 2 \boldsymbol{x}\boldsymbol{x}^\intercal\theta\right\Vert_2 \\\le{}& 2 \left|y\right| \left\Vert\boldsymbol{x}\right\Vert_2 + 2 \left\Vert\boldsymbol{x}\right\Vert_2^2 \sup_{\theta \in \Omega} \left\Vert\theta\right\Vert_2, \end{aligned} \] which in this case gives the same bound as the more tedious computation above.