Lab 1 Exercise - Coding for ML Basics

Author

Erez Buchweitz/Lucas Schwengber

Please complete the following exercises at your own pace.

This lab will be graded based on sincere effort, and completing all the exercises below is not necessary in order to get full grade. Please submit a single file to Gradescope containing your code and answers to the questions below 24h after the end of the lab.

Collaboration in groups of up to 4 students is permitted.

As you do this exercise, be mindful about how you structure your code and the considerations we raised earlier.

1 Ordinary least squares

1.0.1 Initial implementation

The ordinary least squares (OLS) estimator is a \(P\)-dimensional (column) vector given by \[ \hat{\boldsymbol{\beta}}= (\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\boldsymbol{Y} \] where \(\boldsymbol{X}\in\mathbb{R}^{N\times P}\) is a feature matrix of full column rank and \(\boldsymbol{Y}\in\mathbb{R}^N\) is the target (column) vector. The prediction for a new feature (column) vector \(\boldsymbol{x}_\mathrm{new}\in\mathbb{R}^P\) is \(\hat y_\mathrm{new}= \boldsymbol{x}_\mathrm{new}^T\hat{\boldsymbol{\beta}}\).

Generate a matrix \(\boldsymbol{X}\) and an array \(\boldsymbol{Y}\) filled with random values using the Python package numpy.
Implement the calculation of the OLS estimator using numpy.
What is the time complexity of your implementation?
Is there a difference between implementing it like \(((\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T)\boldsymbol{Y}\) and like \((\boldsymbol{X}^T\boldsymbol{X})^{-1}(\boldsymbol{X}^T\boldsymbol{Y})\)? Why? Compare the execution time of the two variants using the package timeit, example:

import timeit
timeit.timeit("my_function(x)", number=10000, globals=globals())

Calculate the OLS estimator again over the same data, this time using the Python package sklearn.
Compare the predictions generated by your implementation with those of sklearn, they should be identical.
Check whether the coefficients computed by sklearn match your own.

Moral

Choosing to use a library function does not absolve you of having to understand what it is doing.

1.0.2 Generalized least squares

The generalized least squares (GLS) estimator is a \(P\)-dimensional (column) vector given by \[ \hat{\boldsymbol{\beta}}= (\boldsymbol{X}^T \boldsymbol{W}\boldsymbol{X})^{-1}\boldsymbol{X}^T \boldsymbol{W}\boldsymbol{Y} \] where \(\boldsymbol{X}\in\mathbb{R}^{N\times P}\) and \(\boldsymbol{Y}\in\mathbb{R}^N\) are as before and \(\boldsymbol{W}\in\mathbb{R}^{N \times N}\) is a positive semi-definite matrix (e.g. symmetric with non-negative eigenvalues). The prediction for a new feature (column) vector \(\boldsymbol{x}_\mathrm{new}\in\mathbb{R}^P\) is the same as before: \(\hat y_\mathrm{new}= \boldsymbol{x}_\mathrm{new}^T\hat{\boldsymbol{\beta}}\).

Generate a random positive semi-definite matrix \(\boldsymbol{W}\) in whatever way you prefer. It must be different from the identity, but that is the only restriction.
Implement the calculation of the GLS using numpy from scratch.
Implement a version of your previous OLS estimator which calls the GLS calculation you built on b. (note that the OLS should not have a weight matrix \(\boldsymbol{W}\) as input).
Do the reverse: do a implementantion of the calculation of the GLS which calls the OLS you implemented on the previous exercise.
Check that all previous implementations yield the same predictions and coefficients on the previous \(\boldsymbol{X}\), \(\boldsymbol{Y}\) and your randomly generated \(\boldsymbol{W}\).

1.0.3 Class implementation

Do an implementation of GLS as a python class which loosely mirrors the structure of sklearn:

It should have a fit method which takes a pair \(\boldsymbol{X}\) and \(\boldsymbol{Y}\).
It should have a predict method that takes a matrix with new data \(\boldsymbol{X}_{new}\) as input.
It should have a coef_ attribute that stores the computed GLS estimator.

2 Implement a loss function

The squared error loss function is given by: \[ \mathscr{L}(y, \hat y) = (y- \hat y)^2. \]

Implement a function that computes the average squared error between two vectors \(\boldsymbol{Y}\) and \(\hat{\boldsymbol{Y}}\), element by element:
- In a vectorized way using numpy.
- Using a for loop, list comprehension or in another non-vectorized way.
Compare the execution time of the two implementations, using timeit.
Why is there a difference in execution time between the two implementations?

Moral

Always use vectorized operations whenever possible.

3 Feature engineering

The OLS prediction \(\hat y=\boldsymbol{x}^T\hat{\boldsymbol{\beta}}\) is linear in the feature vector \(\boldsymbol{x}\in\mathbb{R}^P\). However, we sometimes want to learn a model that is not linear in the features, say: \[ \hat y= \varphi(\boldsymbol{x})^T\hat{\boldsymbol{\beta}} \] where \(\varphi:\mathbb{R}^P\to\mathbb{R}^Q\) is some feature mapping that transforms and possibly adds more features. This can be achieved by transforming our feature matrix \(\boldsymbol{X}\). That is, we replace the feature matrix \(\boldsymbol{X}\in\mathbb{R}^{N\times P}\) with a transformed matrix \(\tilde{\boldsymbol{X}}\in\mathbb{R}^{N\times Q}\) by applying \(\varphi\) to each row, then computing the OLS estimator on \((\tilde{\boldsymbol{X}}, \boldsymbol{Y})\).

For each of the following examples:

Implement \(\varphi\) in Python and compute \(\tilde{\boldsymbol{X}}\).
Fit OLS on \((\tilde{\boldsymbol{X}}, \boldsymbol{Y})\).
Check the averaged squared error on each case.

The examples are:

Linear with intercept: \(\hat y= \beta_0 + \boldsymbol{x}^T\boldsymbol{\beta}\).
Quadratic: \(\hat y= \boldsymbol{x}^T \boldsymbol{B}\boldsymbol{x}\) where \(\boldsymbol{B}\) is a \(P\times P\) matrix.
Step function with cutoffs at known values, over a single feature.
Piecewise linear function with cutoffs at known values, over a single feature.
Piecewise quadratic function with cutoffs at known values, over a single feature.

Moral

Linear regression is flexible and expressive, but you must engineer the features yourself.