Lab 1 Exercise - Coding for ML Basics
Please complete the following exercises at your own pace.
This lab will be graded based on sincere effort, and completing all the exercises below is not necessary in order to get full grade. Please submit a single file to Gradescope containing your code and answers to the questions below 24h after the end of the lab.
Collaboration in groups of up to 4 students is permitted.
As you do this exercise, be mindful about how you structure your code and the considerations we raised earlier.
1 Ordinary least squares
1.0.1 Initial implementation
The ordinary least squares (OLS) estimator is a \(P\)-dimensional (column) vector given by \[ \hat{\boldsymbol{\beta}}= (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y} \] where \(\boldsymbol{X}\in\mathbb{R}^{N\times P}\) is a feature matrix of full column rank and \(\boldsymbol{Y}\in\mathbb{R}^N\) is the target (column) vector. The prediction for a new feature (column) vector \(\boldsymbol{x}_\mathrm{new}\in\mathbb{R}^P\) is \(\hat y_\mathrm{new}= \boldsymbol{x}_\mathrm{new}^T\hat{\boldsymbol{\beta}}\).
- Generate a matrix \(\boldsymbol{X}\) and an array \(\boldsymbol{Y}\) filled with random values using the Python package
numpy. - Implement the calculation of the OLS estimator using
numpy. - What is the time complexity of your implementation?
- Is there a difference between implementing it like \(((\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T)\boldsymbol{Y}\) and like \((\boldsymbol{X}^T\boldsymbol{X})^{-1}(\boldsymbol{X}^T\boldsymbol{Y})\)? Why? Compare the execution time of the two variants using the package
timeit, example:
import timeit
timeit.timeit("my_function(x)", number=10000, globals=globals())- Calculate the OLS estimator again over the same data, this time using the Python package
sklearn. - Compare the predictions generated by your implementation with those of
sklearn, they should be identical. - Check whether the coefficients computed by sklearn match your own.
Choosing to use a library function does not absolve you of having to understand what it is doing.
1.0.2 Generalized least squares
The generalized least squares (GLS) estimator is a \(P\)-dimensional (column) vector given by \[ \hat{\boldsymbol{\beta}}= (\boldsymbol{X}^T \boldsymbol{W}\boldsymbol{X})^{-1}\boldsymbol{X}^T \boldsymbol{W}\boldsymbol{Y} \] where \(\boldsymbol{X}\in\mathbb{R}^{N\times P}\) and \(\boldsymbol{Y}\in\mathbb{R}^N\) are as before and \(\boldsymbol{W}\in\mathbb{R}^{N \times N}\) is a positive semi-definite matrix (e.g. symmetric with non-negative eigenvalues). The prediction for a new feature (column) vector \(\boldsymbol{x}_\mathrm{new}\in\mathbb{R}^P\) is the same as before: \(\hat y_\mathrm{new}= \boldsymbol{x}_\mathrm{new}^T\hat{\boldsymbol{\beta}}\).
- Generate a random positive semi-definite matrix \(\boldsymbol{W}\) in whatever way you prefer. It must be different from the identity, but that is the only restriction.
- Implement the calculation of the GLS using
numpyfrom scratch. - Implement a version of your previous OLS estimator which calls the GLS calculation you built on b. (note that the OLS should not have a weight matrix \(\boldsymbol{W}\) as input).
- Do the reverse: do a implementantion of the calculation of the GLS which calls the OLS you implemented on the previous exercise.
- Check that all previous implementations yield the same predictions and coefficients on the previous \(\boldsymbol{X}\), \(\boldsymbol{Y}\) and your randomly generated \(\boldsymbol{W}\).
1.0.3 Class implementation
Do an implementation of GLS as a python class which loosely mirrors the structure of sklearn:
- It should have a
fitmethod which takes a pair \(\boldsymbol{X}\) and \(\boldsymbol{Y}\). - It should have a
predictmethod that takes a matrix with new data \(\boldsymbol{X}_{new}\) as input. - It should have a
coef_attribute that stores the computed GLS estimator.
2 Implement a loss function
The squared error loss function is given by: \[ \mathscr{L}(y, \hat y) = (y- \hat y)^2. \]
- Implement a function that computes the average squared error between two vectors \(\boldsymbol{Y}\) and \(\hat{\boldsymbol{Y}}\), element by element:
- In a vectorized way using
numpy. - Using a for loop, list comprehension or in another non-vectorized way.
- In a vectorized way using
- Compare the execution time of the two implementations, using
timeit. - Why is there a difference in execution time between the two implementations?
Always use vectorized operations whenever possible.
3 Feature engineering
The OLS prediction \(\hat y=\boldsymbol{x}^T\hat{\boldsymbol{\beta}}\) is linear in the feature vector \(\boldsymbol{x}\in\mathbb{R}^P\). However, we sometimes want to learn a model that is not linear in the features, say: \[ \hat y= \varphi(\boldsymbol{x})^T\hat{\boldsymbol{\beta}} \] where \(\varphi:\mathbb{R}^P\to\mathbb{R}^Q\) is some feature mapping that transforms and possibly adds more features. This can be achieved by transforming our feature matrix \(\boldsymbol{X}\). That is, we replace the feature matrix \(\boldsymbol{X}\in\mathbb{R}^{N\times P}\) with a transformed matrix \(\tilde{\boldsymbol{X}}\in\mathbb{R}^{N\times Q}\) by applying \(\varphi\) to each row, then computing the OLS estimator on \((\tilde{\boldsymbol{X}}, \boldsymbol{Y})\).
For each of the following examples:
- Implement \(\varphi\) in Python and compute \(\tilde{\boldsymbol{X}}\).
- Fit OLS on \((\tilde{\boldsymbol{X}}, \boldsymbol{Y})\).
- Check the averaged squared error on each case.
The examples are:
- Linear with intercept: \(\hat y= \beta_0 + \boldsymbol{x}^T\boldsymbol{\beta}\).
- Quadratic: \(\hat y= \boldsymbol{x}^T \boldsymbol{B}\boldsymbol{x}\) where \(\boldsymbol{B}\) is a \(P\times P\) matrix.
- Step function with cutoffs at known values, over a single feature.
- Piecewise linear function with cutoffs at known values, over a single feature.
- Piecewise quadratic function with cutoffs at known values, over a single feature.
Linear regression is flexible and expressive, but you must engineer the features yourself.