Lab 1 Exercise - Linear Regression Basics
Please complete the following exercises at your own pace.
This lab will be graded based on sincere effort, and completing all the exercises below is not necessary in order to get full grade. At the end of this lab session, please submit a single file to Gradescope containing your code and answers to the questions below.
Collaboration in groups of up to 4 students is permitted.
1 Implement ordinary least squares
The ordinary least squares (OLS) estimator is a \(P\)-dimensional (column) vector given by \[ \hat{\boldsymbol{\beta}}= (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y} \] where \(\boldsymbol{X}\in\mathbb{R}^{N\times P}\) is a feature matrix of full column rank and \(\boldsymbol{Y}\in\mathbb{R}^N\) is the target (column) vector. The prediction for a new feature (column) vector \(\boldsymbol{x}_\mathrm{new}\in\mathbb{R}^P\) is \(\hat y_\mathrm{new}= \boldsymbol{x}_\mathrm{new}^T\hat{\boldsymbol{\beta}}\).
- Generate a matrix \(\boldsymbol{X}\) and an array \(\boldsymbol{Y}\) filled with random values using the Python package
numpy
. - Implement the calculation of the OLS estimator using
numpy
. - What is the time complexity of your implementation?
- Is there a difference between implementing it like \(((\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T)\boldsymbol{Y}\) and like \((\boldsymbol{X}^T\boldsymbol{X})^{-1}(\boldsymbol{X}^T\boldsymbol{Y})\)? Why? Compare the execution time of the two variants using the package
timeit
, example:
import timeit
"my_function(x)", number=10000, globals=globals()) timeit.timeit(
- Calculate the OLS estimator again over the same data, this time using the Python package
sklearn
. - Compare the predictions generated by your implementation with those of
sklearn
, they should be identical.
Choosing to use a library function does not absolve you of having to understand what it is doing.
2 Implement a loss function
The squared error loss function is given by: \[ \mathscr{L}(y, \hat y) = (y- \hat y)^2. \]
- Implement a function that computes the average squared error between two vectors \(\boldsymbol{Y}\) and \(\hat{\boldsymbol{Y}}\), element by element:
- In a vectorized way using
numpy
. - Using a for loop, list comprehension or in another non-vectorized way.
- In a vectorized way using
- Compare the execution time of the two implementations, using
timeit
. - Why is there a difference in execution time between the two implementations?
Always use vectorized operations whenever possible.
3 Train and test data
- Split the rows of \((\boldsymbol{X},\boldsymbol{Y})\) into two sets, call them train and test.
- Fit OLS on the train set, and compute the average squared error of the OLS prediction:
- On the train set.
- On the test set.
- Is there a difference between the two errors? If so, why?
- How does the size of the train set affect the two errors?
Train error is always optimistic compared to test error, which is why we must test our predictions over an independent test set.
4 Feature engineering
The OLS prediction \(\hat y=\boldsymbol{x}^T\hat{\boldsymbol{\beta}}\) is linear in the feature vector \(\boldsymbol{x}\in\mathbb{R}^P\). However, we sometimes want to learn a model that is not linear in the features, say: \[ \hat y= \varphi(\boldsymbol{x})^T\hat{\boldsymbol{\beta}} \] where \(\varphi:\mathbb{R}^P\to\mathbb{R}^Q\) is some feature mapping that transforms and possibly adds more features. This can be achieved by transforming our feature matrix \(\boldsymbol{X}\). That is, we replace the feature matrix \(\boldsymbol{X}\in\mathbb{R}^{N\times P}\) with a transformed matrix \(\tilde{\boldsymbol{X}}\in\mathbb{R}^{N\times Q}\) by applying \(\varphi\) to each row, then computing the OLS estimator on \((\tilde{\boldsymbol{X}}, \boldsymbol{Y})\).
For each of the following examples:
- Write down the formula for \(\varphi\).
- Implement \(\varphi\) in Python and compute \(\tilde{\boldsymbol{X}}\) (remember, vectorized operations are faster).
- Fit OLS on \((\tilde{\boldsymbol{X}}, \boldsymbol{Y})\).
The examples are:
- Linear with intercept: \(\hat y= \beta_0 + \boldsymbol{x}^T\boldsymbol{\beta}\).
- Quadratic: \(\hat y= \boldsymbol{x}^T \boldsymbol{B}\boldsymbol{x}\) where \(\boldsymbol{B}\) is a \(P\times P\) matrix.
- Step function with cutoffs at known values, over a single feature.
- Continuous piece-wise linear function, over a single feature.
- Is it possible to implement an infinite-dimensional feature mapping \(\varphi\)? That is, \(Q=\infty\).
Linear regression is flexible and expressive, but you must engineer the features yourself.