import numpy as np
X_train = np.load(r"../../datasets/linreg/X_train.npy")
y_train = np.load(r"../../datasets/linreg/y_train.npy")
X_train.shape, y_train.shape((8000, 10), (8000,))
Hard? Entertaining? Stressful? Fun?
What did you learn?
What prevented you from showing your true self?
What are you going to do better next time?
Noise in a linear model can come in (at least) three forms:
Inherent unpredictability: \[ y = \beta_1 x_1 + \varepsilon \]
Missing features: \[ y = \beta_1 x_1 + \beta_2 x_2 \]
Model misspecification: \[ y = f(x_1) \]
Comments:
The data in the first competition had only the third kind of noise. It was possible to reach zero loss.
\[ \mathbb{E}_{\vphantom{}}\left[(y-\hat{y})^2\right] = \text{Var}(y) + (\mathbb{E}_{\vphantom{}}\left[y\right] - \mathbb{E}_{\vphantom{}}\left[\hat{y}\right])^2 + \text{Var}(\hat{y}), \] where \((\boldsymbol{x},y)\) is a test observation, and \(\hat{y}=\hat{y}(x)\) is a prediction from a model trained on an independent training set.
It was commonly thought that bias and variance play off of each other in the following way:
(U-shaped plot where x axis is model complexity, y axis is expected test error)
The game is to control both bias and variance in order to find the sweet spot in the middle which obtains least error.
Deep learning (arguably) extends the line further right, to extremely high complexity, and the curve drops again (this is called double descent). This regime went overlooked before. Now we say that the U-shape is called the underparameterized regime, compared to the overparameterized regime to its right.
Two archtypical methods:
Parsimonious approach
Start from a low-complexity model with high bias and low variance, and start increasing complexity. In terms of complexity of function classes, you use a small function class.
Examples:
Expressive + regularization approach
Start from a high-complexity model with low bias and high variance, and use regularization to control complexity. In terms of complexity function classes, you use a (too) big function class, but you look through it restrictively.
Examples:
The two approaches were widely represented in the solutions you submitted:
import numpy as np
X_train = np.load(r"../../datasets/linreg/X_train.npy")
y_train = np.load(r"../../datasets/linreg/y_train.npy")
X_train.shape, y_train.shape((8000, 10), (8000,))
Start with ordinary least squares in the raw features (including intercept).
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True).fit(X_train, y_train)
pred = model.predict(X_train)
def squared_error(y, pred):
error = (y - pred) ** 2
return { "mse": float(np.mean(error)), "se": np.std(error) / np.sqrt(len(error)) }
squared_error(y_train, pred){'mse': 9.782288433235111, 'se': 0.2165719811995329}
Question: Is this loss good or bad?
Question: Is our model overfit or underfit? How can we tell?
We split the training data into train and validation sets. In the future we’ll learn a better method – cross-validation.
from sklearn.model_selection import train_test_split
Xt, Xv, yt, yv = train_test_split(X_train, y_train, test_size=0.25)
Xt.shape, Xv.shape, yt.shape, yv.shape((6000, 10), (2000, 10), (6000,), (2000,))
Now we train on (Xt, yt) and predict on both (Xt, yt) and (Xv, yv).
Question: What do we expect the results to be?
model = LinearRegression(fit_intercept=True).fit(Xt, yt)
predt = model.predict(Xt)
predv = model.predict(Xv)
squared_error(yt, predt), squared_error(yv, predv)({'mse': 9.847348928100097, 'se': 0.2556322288948828},
{'mse': 9.595956082319695, 'se': 0.3993419257171409})
Question: What are your takeaways?
So let’s start making the model more expressive.
First, here’s quantile_plot again.
def quantile_plot(x, y, by=None, bins=10, by_bins=3, y_fn=np.mean):
assert len(x) == len(y)
def qp_data(x, y):
fac = np.searchsorted(np.quantile(x, q=[i / bins for i in range(1, bins)]), x)
ufac = np.unique(fac)
qx = np.array([np.mean(x[fac == f]) for f in ufac])
qy = np.array([y_fn(y[fac == f]) for f in ufac])
return qx, qy
qx, qy = qp_data(x, y)
if by is None:
plt.plot(qx, qy, "-o")
else:
assert len(x) == len(by)
plt.plot(qx, qy, "-o", label="ALL", color="lightgrey")
by_fac = np.searchsorted(np.quantile(by, q=[i / by_bins for i in range(1, by_bins)]), by)
by_ufac = np.unique(by_fac)
for i, f in enumerate(np.unique(by_ufac)):
mask = by_fac == f
nm = f"{i}) {min(by[mask]):.2f} / {max(by[mask]):.2f}"
qx, qy = qp_data(x[mask], y[mask])
plt.plot(qx, qy, "-o", label=nm)
plt.legend()Now let’s look at all the plots.
from matplotlib import pyplot as plt
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
axes = axes.flatten()
for i in range(Xt.shape[1]):
plt.sca(axes[i])
quantile_plot(Xt[:, i], yt)
axes[i].set_title(i)
plt.tight_layout()
plt.show()
Question: What do you see?
0, 4, 5, 7 look fairly linear, despite some noise / irregularities1, 8 don’t look very informative - they’re noisy and with small amplitudeQuestion: Why does the amplitude matter?
2, 6 look informative but not linear3, 9 look like they might be informative but they’re oddQuestion: How do you engineer features 2 and 6?
x2^3 and e^x6, which are reasonable guessesLet’s first see quantile plots after x^3 and e^x transformations:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
plt.sca(axes[0])
quantile_plot(Xt[:, 2] ** 3, yt)
axes[0].set_title("2 cubed")
plt.sca(axes[1])
quantile_plot(np.exp(Xt[:, 6]), yt)
axes[1].set_title("6 exp")
plt.tight_layout()
plt.show()
They’re fairly linear now.
Let’s try splining them instead.
Let’s compare three models:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import SplineTransformer
from sklearn.pipeline import Pipeline
# base model
predictions = {
"base": predv
}
# manually engineered model
def engineer(X):
X_new = X.copy()
X_new[:, 2] = X_new[:, 2] ** 3
X_new[:, 6] = np.exp(X_new[:, 6])
return X_new
predictions["manual_x2x6"] = LinearRegression(fit_intercept=True).fit(engineer(Xt), yt).predict(engineer(Xv))
# automatically engineered model using splines
spline = SplineTransformer(degree=3, n_knots=8, knots="quantile", extrapolation="linear", include_bias=False)
pipeline = Pipeline([
("spline", ColumnTransformer([
("spline", spline, [2, 6]),
("pass", "passthrough", [0, 1, 3, 4, 5, 7, 8, 9])])),
("ols", LinearRegression(fit_intercept=True))
])
predictions["spline_x2x6"] = pipeline.fit(Xt, yt).predict(Xv)
print(f"number of features in spline model: {pipeline.named_steps['ols'].coef_.shape[0]}")
# compute error
def compute_error_all_predictions():
return {nm: squared_error(yv, pred) for nm, pred in predictions.items()}
compute_error_all_predictions()number of features in spline model: 26
{'base': {'mse': 9.595956082319695, 'se': 0.3993419257171409},
'manual_x2x6': {'mse': 8.827973276451052, 'se': 0.3793705573366164},
'spline_x2x6': {'mse': 8.819681283292375, 'se': 0.3796052459181792}}
Splines are only slightly worse than manual transformations (with no regularization).
Now try splining all “nonlinear” features:
# spline all nonlinear features
pipeline = Pipeline([
("spline", ColumnTransformer([
("spline", spline, [0, 2, 3, 4, 5, 6, 9]),
("pass", "passthrough", [1, 7, 8])])),
("ols", LinearRegression(fit_intercept=True))
])
predictions["spline_all"] = pipeline.fit(Xt, yt).predict(Xv)
print(f"number of features in spline model: {pipeline.named_steps['ols'].coef_.shape[0]}")
compute_error_all_predictions()number of features in spline model: 66
{'base': {'mse': 9.595956082319695, 'se': 0.3993419257171409},
'manual_x2x6': {'mse': 8.827973276451052, 'se': 0.3793705573366164},
'spline_x2x6': {'mse': 8.819681283292375, 'se': 0.3796052459181792},
'spline_all': {'mse': 7.442751406616368, 'se': 0.37486914828513673}}
Let’s look at all 10 x 10 interactions:
fig, axes = plt.subplots(10, 10, figsize=(75, 50))
for i in range(Xt.shape[1]):
for j in range(Xt.shape[1]):
plt.sca(axes[i, j])
quantile_plot(Xt[:, i], yt, by=Xt[:, j])
axes[i, j].set_title(f"{i} by {j}")
plt.show()
Question: Which pairs of features have interactions?
Seems like: - 0 by 5, 5 by 0 - 1 by 8, 8 by 1
Other possibility (which I know is incorrect but we will test it): - 1 by 0, 0 by 1
Let’s study each in turn.
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 1], yt, by=Xt[:, 8])
axes[0].set_title("1 by 8")
plt.sca(axes[1])
quantile_plot(Xt[:, 8], yt, by=Xt[:, 1])
axes[1].set_title("8 by 1")
plt.show()
Question: What do you see?
f symmetric in y ~ f(x1, x8)by groupx1 increases with x8, and vice versaQuestion: Which f can cause plots like this?
f(x1, x8) = x1 * x8 (possibly * beta).x1 is x8 (and vice versa)Question: What subsequent tests can we do to further support this hypothesis?
Question: What do you expect to see if the hypothesis is correct?
by bin, the relationship should be linearbyfig, axes = plt.subplots(1, 2, figsize=(15, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 1], yt, by=Xt[:, 8], by_bins=10)
axes[0].set_title("1 by 8")
plt.sca(axes[1])
quantile_plot(Xt[:, 8], yt, by=Xt[:, 1], by_bins=10)
axes[1].set_title("8 by 1")
plt.show()
Question: What do you see?
fig, axes = plt.subplots(1, 3, figsize=(22.5, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 1], yt, by_bins=10)
axes[0].set_title("x1")
plt.sca(axes[1])
quantile_plot(Xt[:, 8], yt)
axes[1].set_title("x8")
plt.sca(axes[2])
quantile_plot(Xt[:, 1] * Xt[:, 8], yt)
axes[2].set_title("x1 x x8")
plt.show()
Look at the VERY BIG aplitude! This was an important feature.
def engineer_18(X):
X_new = (X[:, 1] * X[:, 8]).reshape(-1, 1)
return np.hstack([X, X_new])
pipeline = Pipeline([
("spline", ColumnTransformer([
("spline", spline, [0, 2, 3, 4, 5, 6, 9]),
("pass", "passthrough", [1, 7, 8, 10])])),
("ols", LinearRegression(fit_intercept=True))
])
predictions["spline_int18"] = pipeline.fit(engineer_18(Xt), yt).predict(engineer_18(Xv))
compute_error_all_predictions(){'base': {'mse': 9.595956082319695, 'se': 0.3993419257171409},
'manual_x2x6': {'mse': 8.827973276451052, 'se': 0.3793705573366164},
'spline_x2x6': {'mse': 8.819681283292375, 'se': 0.3796052459181792},
'spline_all': {'mse': 7.442751406616368, 'se': 0.37486914828513673},
'spline_int18': {'mse': 3.034177239838048, 'se': 0.19362113044516666},
'spline_int18_05_linear': {'mse': 1.1357325964525216,
'se': 0.07024366004927746},
'spline_int18_05': {'mse': 0.13793116343316109, 'se': 0.0073451236483432275},
'spline_int_01': {'mse': 7.410697986992731, 'se': 0.37307490009933925}}
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 0], yt, by=Xt[:, 5])
axes[0].set_title("0 by 5")
plt.sca(axes[1])
quantile_plot(Xt[:, 5], yt, by=Xt[:, 0])
axes[1].set_title("5 by 0")
plt.show()
Question: What do you see?
f asymmetric in y ~ f(x1, x8)x0 crosses 0Question: Which f can cause plots like this?
x5 and y seems to change sign when x0 crosses 0fig, axes = plt.subplots(1, 2, figsize=(15, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 0], yt, by=Xt[:, 5], by_bins=10)
axes[0].set_title("0 by 5")
plt.sca(axes[1])
quantile_plot(Xt[:, 5], yt, by=Xt[:, 0], by_bins=10)
axes[1].set_title("5 by 0")
plt.show()
Question: What do you see?
Right-hand side plot: - There seems to be a group of x0 < 0 in which all by_bins are similar, linear and decreasing, and a group of x0 > 0 in which the all by_bins are similar, linear and increasing. - Perhaps learn the two slopes: x5 * I(x0 > 0) and x5 * I(x0 <= 0)
Left-hand side plot: - Generally fits the hypothesis from the right-hand side plot
Here is a first more naive guess that the relationship is x0*x5
def engineer_18_05(X):
X_new = np.array([
X[:, 1] * X[:, 8],
X[:, 5] * X[:, 0],
]).T
return np.hstack([X, X_new])
pipeline = Pipeline([
("spline", ColumnTransformer([
("spline", spline, [0, 2, 3, 4, 5, 6, 9]),
("pass", "passthrough", [1, 7, 8, 10, 11])])),
("ols", LinearRegression(fit_intercept=True))
])
predictions["spline_int18_05_linear"] = pipeline.fit(engineer_18_05(Xt), yt).predict(engineer_18_05(Xv))
compute_error_all_predictions(){'base': {'mse': 9.595956082319695, 'se': 0.3993419257171409},
'manual_x2x6': {'mse': 8.827973276451052, 'se': 0.3793705573366164},
'spline_x2x6': {'mse': 8.819681283292375, 'se': 0.3796052459181792},
'spline_all': {'mse': 7.442751406616368, 'se': 0.37486914828513673},
'spline_int18': {'mse': 3.034177239838048, 'se': 0.19362113044516666},
'spline_int18_05_linear': {'mse': 1.1357325964525216,
'se': 0.07024366004927746},
'spline_int18_05': {'mse': 0.13793116343316109, 'se': 0.0073451236483432275},
'spline_int_01': {'mse': 7.410697986992731, 'se': 0.37307490009933925}}
Here is our better guess on the interaction
def engineer_18_05(X):
X_new = np.array([
X[:, 1] * X[:, 8],
X[:, 5] * (X[:, 0] > 0),
X[:, 5] * (X[:, 0] <= 0)
]).T
return np.hstack([X, X_new])
pipeline = Pipeline([
("spline", ColumnTransformer([
("spline", spline, [0, 2, 3, 4, 5, 6, 9]),
("pass", "passthrough", [1, 7, 8, 10, 11, 12])])),
("ols", LinearRegression(fit_intercept=True))
])
predictions["spline_int18_05"] = pipeline.fit(engineer_18_05(Xt), yt).predict(engineer_18_05(Xv))
compute_error_all_predictions(){'base': {'mse': 9.595956082319695, 'se': 0.3993419257171409},
'manual_x2x6': {'mse': 8.827973276451052, 'se': 0.3793705573366164},
'spline_x2x6': {'mse': 8.819681283292375, 'se': 0.3796052459181792},
'spline_all': {'mse': 7.442751406616368, 'se': 0.37486914828513673},
'spline_int18': {'mse': 3.034177239838048, 'se': 0.19362113044516666},
'spline_int18_05_linear': {'mse': 1.1357325964525216,
'se': 0.07024366004927746},
'spline_int18_05': {'mse': 0.13793116343316167, 'se': 0.007345123648343288},
'spline_int_01': {'mse': 7.410697986992731, 'se': 0.37307490009933925}}
Let’s compare to train error again.
predt = pipeline.predict(engineer_18_05(Xt))
squared_error(yt, predt){'mse': 0.13803364861610007, 'se': 0.004191258457428897}
Apparently not much.
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 0], yt, by=Xt[:, 1], by_bins=3)
axes[0].set_title("0 by 1")
plt.sca(axes[1])
quantile_plot(Xt[:, 1], yt, by=Xt[:, 0], by_bins=3)
axes[1].set_title("1 by 0")
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 0], yt, by=Xt[:, 1], by_bins=10)
axes[0].set_title("0 by 1")
plt.sca(axes[1])
quantile_plot(Xt[:, 1], yt, by=Xt[:, 0], by_bins=10)
axes[1].set_title("1 by 0")
plt.show()
This interaction seems complicated. Here is a guess we can make: x1 is a polynomial with different coefficients depending on x0. How can we encode this?
Here is a starting guess: f(x1,x0) = x0 * x1 + x0 * x1^2 + x0 * x1^3 + x0 * x1^3
def engineer_01(X):
X_new = np.array([
X[:, 1] * X[:, 0],
X[:, 1]**2 * X[:, 0],
X[:, 1]**3 * X[:, 0]
]).T
return np.hstack([X, X_new])
pipeline = Pipeline([
("spline", ColumnTransformer([
("spline", spline, [0, 2, 3, 4, 5, 6, 9]),
("pass", "passthrough", [1, 7, 8, 10, 11, 12])])),
("ols", LinearRegression(fit_intercept=True))
])
predictions["spline_int_01"] = pipeline.fit(engineer_01(Xt), yt).predict(engineer_01(Xv))
compute_error_all_predictions(){'base': {'mse': 9.595956082319695, 'se': 0.3993419257171409},
'manual_x2x6': {'mse': 8.827973276451052, 'se': 0.3793705573366164},
'spline_x2x6': {'mse': 8.819681283292375, 'se': 0.3796052459181792},
'spline_all': {'mse': 7.442751406616368, 'se': 0.37486914828513673},
'spline_int18': {'mse': 3.034177239838048, 'se': 0.19362113044516666},
'spline_int18_05_linear': {'mse': 1.1357325964525216,
'se': 0.07024366004927746},
'spline_int18_05': {'mse': 0.13793116343316167, 'se': 0.007345123648343288},
'spline_int_01': {'mse': 7.397779730432853, 'se': 0.37204391881342086}}
We see from the results above that the decrease in the test error is very close to the one we had with only the splines. In particular, it is not outside the standard deviations bands we have for that previous error. This suggests that this interaction is not really there.
We could not find the exact code which generated the runner ups 24080110_feb9_final.npy file. However, we were able to look at their code to get a general idea of what tried.
1. Generate a lot of features and then;
2. Prune using ridge regression.
X)X^2)sign(X) * sqrt(|X|))log(1 + |X|))1 / (1 + |X|))tanh(X))X / (1 + |X|))arctan(X))sin(X), cos(X))sin(2X), cos(2X))max(0, X - t) and max(0, t - X)t ∈ {-2, -1, 0, 1, 2}exp(-0.5 * ((X - c) / s)^2)c are centers and s are scalesXX[:, i] * X[:, j]log1p(|X_i|) * X_jtanh(X_i) * X_jZ = X @ R, where R is a random matrixThere was too much code to paste into this document but their submission was
MSE: 0.5260523711603559
Question: What did the runner-up got right and wrong?
Final remark: automatically generating features may work well on datasets with little noise