Lab 1 Exercise - Linear Regression Basics

Author

Erez Buchweitz

Please complete the following exercises at your own pace.

This lab will be graded based on sincere effort, and completing all the exercises below is not necessary in order to get full grade. At the end of this lab session, please submit a single file to Gradescope containing your code and answers to the questions below.

Collaboration in groups of up to 4 students is permitted.

1 Implement ordinary least squares

The ordinary least squares (OLS) estimator is a \(P\)-dimensional (column) vector given by \[ \hat{\boldsymbol{\beta}}= (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y} \] where \(\boldsymbol{X}\in\mathbb{R}^{N\times P}\) is a feature matrix of full column rank and \(\boldsymbol{Y}\in\mathbb{R}^N\) is the target (column) vector. The prediction for a new feature (column) vector \(\boldsymbol{x}_\mathrm{new}\in\mathbb{R}^P\) is \(\hat y_\mathrm{new}= \boldsymbol{x}_\mathrm{new}^T\hat{\boldsymbol{\beta}}\).

Generate a matrix \(\boldsymbol{X}\) and an array \(\boldsymbol{Y}\) filled with random values using the Python package numpy.
Implement the calculation of the OLS estimator using numpy.
What is the time complexity of your implementation?
Is there a difference between implementing it like \(((\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T)\boldsymbol{Y}\) and like \((\boldsymbol{X}^T\boldsymbol{X})^{-1}(\boldsymbol{X}^T\boldsymbol{Y})\)? Why? Compare the execution time of the two variants using the package timeit, example:

import timeit
timeit.timeit("my_function(x)", number=10000, globals=globals())

Calculate the OLS estimator again over the same data, this time using the Python package sklearn.
Compare the predictions generated by your implementation with those of sklearn, they should be identical.

Moral

Choosing to use a library function does not absolve you of having to understand what it is doing.

2 Implement a loss function

The squared error loss function is given by: \[ \mathscr{L}(y, \hat y) = (y- \hat y)^2. \]

Implement a function that computes the average squared error between two vectors \(\boldsymbol{Y}\) and \(\hat{\boldsymbol{Y}}\), element by element:
- In a vectorized way using numpy.
- Using a for loop, list comprehension or in another non-vectorized way.
Compare the execution time of the two implementations, using timeit.
Why is there a difference in execution time between the two implementations?

Moral

Always use vectorized operations whenever possible.

3 Train and test data

Split the rows of \((\boldsymbol{X},\boldsymbol{Y})\) into two sets, call them train and test.
Fit OLS on the train set, and compute the average squared error of the OLS prediction:
- On the train set.
- On the test set.
Is there a difference between the two errors? If so, why?
How does the size of the train set affect the two errors?

Moral

Train error is always optimistic compared to test error, which is why we must test our predictions over an independent test set.

4 Feature engineering

The OLS prediction \(\hat y=\boldsymbol{x}^T\hat{\boldsymbol{\beta}}\) is linear in the feature vector \(\boldsymbol{x}\in\mathbb{R}^P\). However, we sometimes want to learn a model that is not linear in the features, say: \[ \hat y= \varphi(\boldsymbol{x})^T\hat{\boldsymbol{\beta}} \] where \(\varphi:\mathbb{R}^P\to\mathbb{R}^Q\) is some feature mapping that transforms and possibly adds more features. This can be achieved by transforming our feature matrix \(\boldsymbol{X}\). That is, we replace the feature matrix \(\boldsymbol{X}\in\mathbb{R}^{N\times P}\) with a transformed matrix \(\tilde{\boldsymbol{X}}\in\mathbb{R}^{N\times Q}\) by applying \(\varphi\) to each row, then computing the OLS estimator on \((\tilde{\boldsymbol{X}}, \boldsymbol{Y})\).

For each of the following examples:

Write down the formula for \(\varphi\).
Implement \(\varphi\) in Python and compute \(\tilde{\boldsymbol{X}}\) (remember, vectorized operations are faster).
Fit OLS on \((\tilde{\boldsymbol{X}}, \boldsymbol{Y})\).

The examples are:

Linear with intercept: \(\hat y= \beta_0 + \boldsymbol{x}^T\boldsymbol{\beta}\).
Quadratic: \(\hat y= \boldsymbol{x}^T \boldsymbol{B}\boldsymbol{x}\) where \(\boldsymbol{B}\) is a \(P\times P\) matrix.
Step function with cutoffs at known values, over a single feature.
Continuous piece-wise linear function, over a single feature.
Is it possible to implement an infinite-dimensional feature mapping \(\varphi\)? That is, \(Q=\infty\).

Moral

Linear regression is flexible and expressive, but you must engineer the features yourself.