Linear regression in matrix notation (review)

Author

Ryan Giordano

Goals

Express ordinary least squares regression in matrix notation

Reading

These lecture notes are a supplement for

Hastie, Tibshirani, and Friedman (2009) Section 3.2
Ding (2024) Chapter 3

Matrix notation for linear regression (review)

Here is a formal definition of a linear regression model:

\[ \begin{align*} y_n ={}& \beta_1 x_{n1} + \beta_2 x_{n2} + \ldots + x_{nP} + \varepsilon_{n}, \quad\textrm{For }n=1,\ldots,N. \end{align*} \tag{1}\]

Notation

I will always use \(N\) for the number of observed data points, and \(P\) for the dimension of the regression vector.

For example, if we take \(x_{n1} \equiv 1\), \(x_{n2}= x_n\) to be some scalar, and \(P = 2\), then Equation 1 becomes simple linear regression:

\[ \begin{align*} y_n ={}& \beta_1 + \beta_2 x_{n} + \varepsilon_{n}, \quad\textrm{For }n=1,\ldots,N. \end{align*} \]

The residuals \(\varepsilon_n\) measure the “misfit” of the line. If you know \(\beta_1, \ldots, \beta_P\), then you can compute

\[ \begin{align*} \varepsilon_n ={}& y_n - (\beta_1 x_{n1} + \beta_2 x_{n2} + \ldots + x_{nP}). \end{align*} \]

But in general we only observe \(y_n\) and \(x_{n1}, \ldots, x_{nP}\), and we choose \(\beta_1, \ldots, \beta_P\) to make the residuals small. (How we do this precisely will be something we talk about at great length.)

The general form of Equation 1 can be written more compactly using matrix and vector notation. Specifically, if we let

\[ \begin{align*} \boldsymbol{x}_n := \begin{pmatrix} x_{n1} \\ x_{n2} \\ \vdots \\ x_{nP} \end{pmatrix} \quad \textrm{and} \quad \boldsymbol{\beta}:= \begin{pmatrix} \beta_{1} \\ \beta_2 \\ \vdots \\ \beta_{P} \end{pmatrix} \end{align*} \]

Notation

Bold lowercase variables are column vectors (unless otherwise specified).

Recall that the “transpose” operator \((\cdot)^\intercal\) flips the row and columns of a matrix. For example,

\[ \begin{align*} \boldsymbol{x}_n ^\intercal= \begin{pmatrix} x_{n1} & x_{n2} & \ldots & x_{nP} \end{pmatrix}. \end{align*} \]

By matrix multiplication rules,

\[ \begin{align*} \boldsymbol{x}_n^\intercal\boldsymbol{\beta}= \begin{pmatrix} x_{n1} & x_{n2} & \ldots & x_{nP} \end{pmatrix} \quad\quad\quad \begin{pmatrix} \beta_{1} \\ \beta_2 \\ \vdots \\ \beta_{P} \end{pmatrix} = \beta_1 x_{n1} + \beta_2 x_{n2} + \ldots + x_{nP}. \end{align*} \]

Notation

I have written \(\boldsymbol{x}_n^\intercal\boldsymbol{\beta}\) for the “dot product” or “inner product” between \(\boldsymbol{x}_n\) and \(\boldsymbol{\beta}\). Writing it in this way clarifies the relationship with matrix notation below.

There are many other ways to denote inner products in the literature, including \(\boldsymbol{x}_n \cdot \boldsymbol{\beta}\) and \(<\boldsymbol{x}_n, \boldsymbol{\beta}>\).

Then we can compactly write

\[ \begin{align*} y_n ={}& \boldsymbol{x}_n ^\intercal\boldsymbol{\beta}+ \varepsilon_{n}, \quad\textrm{For }n=1,\ldots,N. \end{align*} \]

We can compactify it even further if we stack the \(n\) observations: % \[ \begin{align*} y_1 ={}& \boldsymbol{x}_1 ^\intercal\boldsymbol{\beta}+ \varepsilon_{1} \\ y_2 ={}& \boldsymbol{x}_2 ^\intercal\boldsymbol{\beta}+ \varepsilon_{2} \\ \vdots\\ y_N ={}& \boldsymbol{x}_N ^\intercal\boldsymbol{\beta}+ \varepsilon_{N} \\ \end{align*} \]

As before we can stack the responses and residuals:

\[ \begin{align*} \boldsymbol{Y}:= \begin{pmatrix} y_{1} \\ y_2 \\ \vdots \\ y_{P} \end{pmatrix} \quad \textrm{and} \quad \boldsymbol{\varepsilon}:= \begin{pmatrix} \varepsilon_{1} \\ \varepsilon_2 \\ \vdots \\ \varepsilon_{P} \end{pmatrix} \end{align*} \]

We can also stack the regressors:

\[ \begin{align*} \boldsymbol{X}:= \begin{pmatrix} x_{11} & x_{12} & \ldots & x_{1P}\\ x_{21} & x_{22} & \ldots & x_{2P}\\ \vdots\\ x_{n1} & x_{n2} & \ldots & x_{nP}\\ \vdots\\ x_{N1} & x_{N2} & \ldots & x_{NP} \end{pmatrix} = \begin{pmatrix} \boldsymbol{x}_{1}^\intercal\\ \boldsymbol{x}_{2}^\intercal\\ \vdots \\ \boldsymbol{x}_n^\intercal\\ \vdots \\ \boldsymbol{x}_{N}^\intercal \end{pmatrix} \end{align*} \]

Notation

I will use upper case bold letters for multi-dimensional matrices like \(\boldsymbol{X}\). But I may also use upper case bold letters even when the quantity could also be a column vector, when I think it’s more useful to think of the quantity as a matrix with a single column. Examples are \(\boldsymbol{Y}\) above, or \(\boldsymbol{X}\) when \(P = 1\).

Note that by matrix multiplication rules,

\[ \begin{align*} \boldsymbol{X}= \begin{pmatrix} \boldsymbol{x}_{1}^\intercal\\ \boldsymbol{x}_{2}^\intercal\\ \vdots \\ \boldsymbol{x}_n^\intercal\\ \vdots \\ \boldsymbol{x}_{N}^\intercal \end{pmatrix} \quad\quad\quad \boldsymbol{X}\boldsymbol{\beta} = \begin{pmatrix} \boldsymbol{x}_{1}^\intercal\boldsymbol{\beta}\\ \boldsymbol{x}_{2}^\intercal\boldsymbol{\beta}\\ \vdots \\ \boldsymbol{x}_n^\intercal\boldsymbol{\beta}\\ \vdots \\ \boldsymbol{x}_{N}^\intercal\boldsymbol{\beta} \end{pmatrix} \end{align*} \]

so we end up with the extremely tidy expression

\[ \begin{align*} y_n ={}& \beta_1 x_{n1} + \beta_2 x_{n2} + \ldots + x_{nP} + \varepsilon_{n}, \quad\textrm{For }n=1,\ldots,N \\\\&\textrm{is the same as}\quad\\\\ \boldsymbol{Y}={}& \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}. \end{align*} \tag{2}\]

In the case of simple least squares, we can write

\[ \begin{align*} \boldsymbol{X}:= \begin{pmatrix} 1 & x_{1}\\ 1 & x_{2}\\ \vdots & \vdots\\ 1 & x_{N}\\ \end{pmatrix}, \end{align*} \tag{3}\]

and verify that the \(n\)–th row of Equation 2 is the same as ?@eq-lm-simple.

Least squares in matrix notation

Using our tidy expression Equation 2, we can easily write out the sum of the squared errors as

\[ \begin{aligned} \sum_{n=1}^N\varepsilon_n^2 ={}& \boldsymbol{\varepsilon}^\intercal\boldsymbol{\varepsilon}= (\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta})^\intercal(\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta}) \\={}& \boldsymbol{Y}^\intercal\boldsymbol{Y}- \boldsymbol{\beta}^\intercal\boldsymbol{X}^\intercal\boldsymbol{Y}- \boldsymbol{Y}^\intercal\boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\beta}^\intercal\boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{\beta} \\={}& \boldsymbol{Y}^\intercal\boldsymbol{Y}- 2 \boldsymbol{Y}^\intercal\boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\beta}^\intercal\boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{\beta}. \end{aligned} \]

This is a quadratic function of the vector \(\boldsymbol{\beta}\). We wish to find the minimum of this quantity as a function of \(\boldsymbol{\beta}\). We might hope that the minimum occurs at a point where the gradient of this expression is zero.

Rather than compute the univariate derivative with respect to each component, we can compute the multivariate gradient with respect to the vector.

Let’s recall some facts from vector calculus.

Notation

Take \(\boldsymbol{z}\in \mathbb{R}^P\) to be a \(P\)–vector. and let \(f(\boldsymbol{z})\) a scalar–valued function of the vector \(z\). We write

\[ \frac{\partial f(\boldsymbol{z})}{\partial \boldsymbol{z}} = \begin{pmatrix} \frac{\partial}{\partial z_1} f(\boldsymbol{z}) \\ \vdots\\ \frac{\partial}{\partial z_P} f(\boldsymbol{z}) \\ \end{pmatrix}. \]

That is, the partial \(\frac{\partial f(\boldsymbol{z})}{\partial \boldsymbol{z}}\) is a \(P\)–vector of the stacked univariate dervatives.

Recall a couple rules from vector calculus. Let \(\boldsymbol{v}\) denote a \(P\)–vector and \(\boldsymbol{A}\) a symmetric matrix. Then

\[ \begin{align*} \frac{\partial \boldsymbol{v}^\intercal\boldsymbol{z}}{\partial \boldsymbol{z}} = \boldsymbol{v} \quad\textrm{and}\quad \frac{\partial \boldsymbol{z}^\intercal\boldsymbol{A}\boldsymbol{z}}{\partial \boldsymbol{z}} = 2 \boldsymbol{A}\boldsymbol{z}. \end{align*} \]

Exercise

Prove these results above using univariate derivatives and our stacking convention.

Applying these two rules to our least squares objective,

\[ \begin{aligned} \frac{\partial\boldsymbol{\varepsilon}^\intercal\boldsymbol{\varepsilon}}{\partial \boldsymbol{\beta}} ={}& \frac{\partial }{\partial \boldsymbol{\beta}} \boldsymbol{Y}^\intercal\boldsymbol{Y}- 2 \frac{\partial }{\partial \boldsymbol{\beta}} \boldsymbol{Y}^\intercal\boldsymbol{X}\boldsymbol{\beta}+ \frac{\partial }{\partial \boldsymbol{\beta}} \boldsymbol{\beta}^\intercal\boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{\beta} \\ ={}& 0 - 2 \boldsymbol{X}^\intercal\boldsymbol{Y}+ 2 \boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{\beta}. \end{aligned} \]

Assuming our estimator \(\hat{\beta}\) sets these partial derivatives are equal to zero, we then get

\[ \begin{align*} \boldsymbol{X}^\intercal\boldsymbol{X}\hat{\boldsymbol{\beta}}={}& \boldsymbol{X}^\intercal\boldsymbol{Y}. \end{align*} \tag{4}\]

This is a set of \(P\) equations in \(P\) unknowns. If it is not degenerate, one can solve for \(\hat{\boldsymbol{\beta}}\). That is, if the matrix \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is invertible, then we can multiply both sides of Equation 4 by \((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\) to get

\[ \hat{\boldsymbol{\beta}}= (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y} \tag{5}\]

Notation

Unlike many other regression texts (and the lm function), I will not necessarily assume that a constant is included in the regression. One can always take a generic regression \(y_n \sim \boldsymbol{x}_n\) to include a constant by assuming that one of the entries of \(\boldsymbol{x}_n\) is one. At some points my convention of not including a constant by default will lead to formulas that may be at odds with some textbooks. But these differences are superficial, and are, in my mind, more than made up for by the generality and simplicity of treating constants as just another regressor.

References

Ding, P. 2024. “Linear Model and Extensions.” https://arxiv.org/abs/2401.00649.

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer.