Lab 2 Exercise - Linear Regression Challenge

Author

Erez Buchweitz/Lucas Schwengber/Josh Davis

You are given three files:

You may download and read them like this:

import numpy as np
import urllib.request

urllib.request.urlretrieve('https://stat154.berkeley.edu/spring-2026/datasets/linreg/X_train.npy', '{path}/X_train.npy')    
urllib.request.urlretrieve('https://stat154.berkeley.edu/spring-2026/datasets/linreg/X_test.npy', '{path}/X_test.npy')
urllib.request.urlretrieve('https://stat154.berkeley.edu/spring-2026/datasets/linreg/y_train.npy', '{path}/y_train.npy')

X_train = np.load("{path}/X_train.npy")
X_test = np.load("{path}/X_test.npy")
y_train = np.load("{path}/y_train.npy")

There are two data sets, (X_train, y_train) and (X_test, y_test), which are independent and identically distributed. However, I have hidden y_test from you. Your task is to make the best predictions of y_test possible based on the training data and on X_test.

1 How to submit predictions

You will submit a binary .npy file containing a numpy array (one-dimensional), whose length is the same as the number of rows in X_test, that will comprise the predictions of y_test. You can save an array to file like this:

import numpy as np

np.save("pred_{codename}.npy", pred)

The name of the file should be "pred_{codename}.npy" where {codename} is a secret codename of your group’s choosing, that you will be able to use later to identify your submission in the leaderboard. The file, alongside your code, should be submitted to GradeScope.

2 How your predictions will be scored

Your predictions will be compared against the real y_test using average squared error:

\[ \text{your loss} = \frac{1}{N}\sum_{n=1}^N (y_n - \text{pred}_n)^2. \]

A leaderboard will be published with every group’s codename alongside its loss on the test set, ranked from low to high. No other identifying information will be shared, so no other student will be able know your rank on the leaderboard. The leaderboard is for your own benefit! The group with the lowest test loss is the winner.

3 Allowed methods to use

You are permitted to use only methods which you have learned in the lectures or labs, or that are adjacent to them. This includes, for example, ordinary least squares and feature engineering. Any use of significantly more advanced learning algorithms might result in disqualification and zero grade.

4 How you will be graded

A submission reflecting honest effort will receive full grade. Your rank on the leaderboard will not factor into your grade in any way.

5 Timeline

The challenge will last two consecutive lab sessions, starting from the lab session on Feb 2nd and ending at the end of the lab session on Feb 9th. You will submit predictions at two times:

  • One hour after the end of the lab session on Feb 2nd.
  • One hour after the end of the lab session on Feb 9th.

A leaderboard will be published after each submission deadline and also on the morning of Friday Feb 6th as a checkpoint.

You are not required to work on the challenge outside lab sessions, however it is encouraged. This is for your benefit! The more time you spend working with data you will find you get more out of this course.

6 GOOD LUCK AND HAVE FUN!