Regret decomposition

Author

Ryan Giordano

Regret decomposition

Suppose we have a function class \(\mathcal{F}\), which is a strict subset of all possible functions, and a risk function \(\mathscr{R}(\cdot)\) defined on functions. For example, we have considered the special case

\[ \mathcal{F}_{lin} = \{ f: f(\boldsymbol{x}) = \boldsymbol{x}^\intercal\boldsymbol{\beta}\textrm{ for some }\boldsymbol{\beta}\in \mathbb{R}^{P} \} \quad\textrm{and}\quad \mathscr{R}(\cdot) = \mathbb{E}_{\boldsymbol{x}, y}\left[(y- f(\boldsymbol{x}))^2\right], \]

Note that L1 and L2 regression could equivalently be written as \[ \mathcal{F}_{L1} = \{ f: f\in \mathcal{F}_{lin} \textrm{ with }\left\Vert\boldsymbol{\beta}\right\Vert_1 \le C \textrm{ for some }C > 0 \} \quad\quad \mathcal{F}_{L2} = \{ f: f\in \mathcal{F}_{lin} \textrm{ with }\left\Vert\boldsymbol{\beta}\right\Vert_2 \le C \textrm{ for some }C > 0 \}. \] It is good to keep these examples in mind, but now we will keep our notation more general.

Suppose we define \[ \hat{f}:= \underset{f\in \mathcal{F}}{\mathrm{argmin}}\, \hat{\mathscr{R}}(f) \quad\quad \overline{f}:= \underset{f\in \mathcal{F}}{\mathrm{argmin}}\, \mathscr{R}(f) \quad\quad f^{\star}:= \underset{f}{\mathrm{argmin}}\, \mathscr{R}(f). \] The analogies for linear regression would be \(\hat{f}(\boldsymbol{x}) = \hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}\), \(\overline{f}(\boldsymbol{x}) = {\boldsymbol{\beta}^{*}}^\intercal\boldsymbol{x}\), and \(f^{\star}(\boldsymbol{x}) = \mathbb{E}_{\vphantom{}}\left[y\vert \boldsymbol{x}\right]\).

We are interested in the regret, \(\mathscr{R}(\hat{f}) - \mathscr{R}(f^{\star})\). Recall that we can decompose the regret into two terms:

\[ \mathscr{R}(\hat{f}) - \mathscr{R}(f^{\star}) = \underbrace{\mathscr{R}(\hat{f}) - \mathscr{R}(\overline{f})}_{\textrm{Estimation error}} + \underbrace{\mathscr{R}(\overline{f}) - \mathscr{R}(f^{\star})}_{\textrm{Representation error}}. \]

Intuitively, as our model class \(\mathcal{F}\) gets more complex (e.g. as we reduce the regularization penalty, or add more basis functions to the regression space), we make the representation error lower, but may increase the estimation error.

In an oversimplification, estimation error loosely corresponds to “variance” and representation error to “bias” in correctly specified linear regeression, though this analogy is misleading, since estimation error can be cause both by statistical bias and variance.

However, it is clear that, if there is a cost to expressive \(\mathcal{F}\), it must come from estimation error, and if it were not necessary to fit \(f\) from the data, we would simply make \(\mathcal{F}\) as expressive as possible. Therefore, for this unit, we will focus on understanding the relationship between \(\mathscr{R}(\hat{f}) - \mathscr{R}(\overline{f})\) and complexity of \(\mathcal{F}\).

Estimation error decomposition

Let \(\hat{\mathscr{R}}(f)\) denote the empirical risk evaluated at \(f\). Note that this empirical risk may in practice include a regularization term in order to enforce membership in \(\mathcal{F}\), but we will ignore this in the notation — we will treat \(\hat{\mathscr{R}}(\cdot)\) as something that you (a) minimize to find \(\hat{f}\) and (b) which obeys the LLN for any particular \(f\) in the sense that \(\hat{\mathscr{R}}(f) \rightarrow \mathscr{R}(f)\) as \(N \rightarrow \infty\).

We can further decompose the estimation error:

\[ \begin{aligned} \underbrace{\mathscr{R}(\hat{f}) - \mathscr{R}(\overline{f})}_{\textrm{Estimation error}} ={}& \underbrace{\mathscr{R}(\hat{f}) - \hat{\mathscr{R}}(\hat{f})}_{\textrm{"Generalization error"}} + \underbrace{\hat{\mathscr{R}}(\hat{f}) - \hat{\mathscr{R}}(\overline{f})}_{\textrm{"Negative term" $\le 0$}} + \underbrace{\hat{\mathscr{R}}(\overline{f}) - \mathscr{R}(\overline{f})}_{\textrm{LLN term}} \end{aligned} \]

This decomposition is useful because

  • \(\mathscr{R}(\hat{f}) - \mathscr{R}(\overline{f}) \ge 0\) because \(\overline{f}\) minimizes \(\mathscr{R}(\cdot)\)
  • \(\hat{\mathscr{R}}(\hat{f}) - \hat{\mathscr{R}}(\overline{f}) \le 0\) because \(\hat{f}\) minimizes \(\hat{\mathscr{R}}(\cdot)\)
  • \(\hat{\mathscr{R}}(\overline{f}) - \mathscr{R}(\overline{f}) \rightarrow 0\) by the LLN because \(\overline{f}\) is fixed.

The key term that we don’t know anything about is \(\mathscr{R}(\hat{f}) - \hat{\mathscr{R}}(\hat{f})\), the generalization error. The generalization error measures how well the empirical risk approximates the true risk when evaluated at \(\hat{f}\). Intuitively, we might expect that the generalization error is positive, since we choose \(\hat{f}\) to minimize \(\hat{\mathscr{R}}(\cdot)\), and so \(\hat{\mathscr{R}}(\hat{f})\) might be an “optimistic” estimate of \(\mathscr{R}(\hat{f})\). In particular, we cannot apply the LLN to \(\hat{\mathscr{R}}(\hat{f})\) because \[ \hat{\mathscr{R}}(\hat{f}) = \frac{1}{N} \sum_{n=1}^N\underbrace{\mathscr{L}(\hat{f}(\boldsymbol{x}_n), y_n)}_{\textrm{Not IID!}}, \] and the observations are not IID, since \(\hat{f}\) depends on all the training data.

In itself, this decomposition does not tell us anything, because the genearlization error might be very positive, but canceled by a large “negative term.” However, if we are able to show that the generalization error goes to zero, then

\[ \begin{aligned} &\textrm{If generalization error $\rightarrow 0$, then }\\ &0 \le \mathscr{R}(\hat{f}) - \mathscr{R}(\overline{f}) \le \left|\mathscr{R}(\hat{f}) - \hat{\mathscr{R}}(\hat{f})\right| + \left|\hat{\mathscr{R}}(\overline{f}) - \mathscr{R}(\overline{f})\right| \rightarrow 0 \quad \Rightarrow\\ &\mathscr{R}(\hat{f}) - \mathscr{R}(\overline{f}) \rightarrow 0. \end{aligned} \]

That is, low generalization error is sufficient but not necessary for low estimation error.

This is nice, because we don’t know anything about \(\overline{f}\), and so cannot practically evaluate it. However, we do know \(\hat{f}\), \(\hat{\mathscr{R}}\), and, as we will see, can either estimate \(\mathscr{R}\) using out-of-sample data, or study it using statistical properties of \(\hat{f}\) and \(\hat{\mathscr{R}}\). This unit, then is devoted to studying the generlization error, both from a practical and theoretical point of view.