Lab 4 Presentation - First Challenge Round-Up

Author

Erez Buchweitz/Josh Davis/Lucas Schwengber

1 How was the challenge for you?

Hard? Entertaining? Stressful? Fun?

What did you learn?

What prevented you from showing your true self?

What are you going to do better next time?

2 Was there noise in the data?

Noise in a linear model can come in (at least) three forms:

  1. Inherent unpredictability: \[ y = \beta_1 x_1 + \varepsilon \]

  2. Missing features: \[ y = \beta_1 x_1 + \beta_2 x_2 \]

  3. Model misspecification: \[ y = f(x_1) \]

Comments:

  • Examples of the innovative sources of information: President Trump’s tweets, or pizza delivery traffic near Pentagon
  • The first and second kind are virtually indistinguishable. Some statisticians even claim that the first kind does not even exist

The data in the first competition had only the third kind of noise. It was possible to reach zero loss.

3 Bias-variance decomposition

\[ \mathbb{E}_{\vphantom{}}\left[(y-\hat{y})^2\right] = \text{Var}(y) + (\mathbb{E}_{\vphantom{}}\left[y\right] - \mathbb{E}_{\vphantom{}}\left[\hat{y}\right])^2 + \text{Var}(\hat{y}), \] where \((\boldsymbol{x},y)\) is a test observation, and \(\hat{y}=\hat{y}(x)\) is a prediction from a model trained on an independent training set.

  • The first component is inaccurately referred to as the irreducible error. It would be accurate to say that it is irreducible given the features we currently have, but we may always acquire new features.
  • The second component is the bias.
  • The third component is the variance.

3.1 Bias-variance tradeoff

It was commonly thought that bias and variance play off of each other in the following way:

(U-shaped plot where x axis is model complexity, y axis is expected test error)

  • Low complexity –> high bias, low variance
  • High complexity –> low bias, high variance

The game is to control both bias and variance in order to find the sweet spot in the middle which obtains least error.

  • Most commonly used models (notable exception: deep learning) play off on this tradeoff. They offer different kinds of tradeoff curves that you can minimize
  • Statisticians and computer scientists used to think this was the entire game, but then deep learning came

Deep learning (arguably) extends the line further right, to extremely high complexity, and the curve drops again (this is called double descent). This regime went overlooked before. Now we say that the U-shape is called the underparameterized regime, compared to the overparameterized regime to its right.

  • We won’t focus on overparameterized learning this semester
  • The bias-variance decomposition is a mathematical fact, it remains true in the overparameterized regime. But the interpretation of trading off bias and variance to find the sweet spot is restricted to the underparameterized regime, which is pretty much any other model beside deep learning. And those are widely used in practice in industry and academia

3.2 Controlling bias and variance in the underparameterized regime

Two archtypical methods:

  1. Parsimonious approach

    Start from a low-complexity model with high bias and low variance, and start increasing complexity. In terms of complexity of function classes, you use a small function class.

    Examples:

    • Linear regression with manual feature engineering (like some of you did)
    • Gradient boosting
  2. Expressive + regularization approach

    Start from a high-complexity model with low bias and high variance, and use regularization to control complexity. In terms of complexity function classes, you use a (too) big function class, but you look through it restrictively.

    Examples:

    • Linear regression with many automatically generated transformations (like some of you did)
    • Random forests

4 Back to the challenge

The two approaches were widely represented in the solutions you submitted:

  • Parsimonious - start with OLS on the original raw features, engineer new features manually via exploration, adding only a handful of features to the final model
  • Expressive - start by immediately adding tons of polynomial, spline or other features, tone down using ridge or lass0

4.1 Load data

import numpy as np

X_train = np.load(r"../../datasets/linreg/X_train.npy")
y_train = np.load(r"../../datasets/linreg/y_train.npy")

X_train.shape, y_train.shape
((8000, 10), (8000,))

5 Baseline model

Start with ordinary least squares in the raw features (including intercept).

from sklearn.linear_model import LinearRegression

model = LinearRegression(fit_intercept=True).fit(X_train, y_train)
pred = model.predict(X_train)

def squared_error(y, pred):
    error = (y - pred) ** 2
    return { "mse": float(np.mean(error)), "se": np.std(error) / np.sqrt(len(error)) }

squared_error(y_train, pred)
{'mse': 9.782288433235111, 'se': 0.2165719811995329}

Question: Is this loss good or bad?

Question: Is our model overfit or underfit? How can we tell?

  • Split into train and validation, compare losses on train vs validation
  • Try some regularization

6 Train test split

We split the training data into train and validation sets. In the future we’ll learn a better method – cross-validation.

from sklearn.model_selection import train_test_split

Xt, Xv, yt, yv = train_test_split(X_train, y_train, test_size=0.25)

Xt.shape, Xv.shape, yt.shape, yv.shape
((6000, 10), (2000, 10), (6000,), (2000,))

Now we train on (Xt, yt) and predict on both (Xt, yt) and (Xv, yv).

Question: What do we expect the results to be?

model = LinearRegression(fit_intercept=True).fit(Xt, yt)
predt = model.predict(Xt)
predv = model.predict(Xv)

squared_error(yt, predt), squared_error(yv, predv)
({'mse': 9.847348928100097, 'se': 0.2556322288948828},
 {'mse': 9.595956082319695, 'se': 0.3993419257171409})

Question: What are your takeaways?

  • The loss on the training set is not that much lower than on the validation set, overfit is mild
  • Can probably increase expressivity before overfitting becomes a problem

7 Feature engineering

So let’s start making the model more expressive.

First, here’s quantile_plot again.

def quantile_plot(x, y, by=None, bins=10, by_bins=3, y_fn=np.mean):
    assert len(x) == len(y)

    def qp_data(x, y):
        fac = np.searchsorted(np.quantile(x, q=[i / bins for i in range(1, bins)]), x)
        ufac = np.unique(fac)
        qx = np.array([np.mean(x[fac == f]) for f in ufac])
        qy = np.array([y_fn(y[fac == f]) for f in ufac])
        return qx, qy

    qx, qy = qp_data(x, y)
    if by is None:
        plt.plot(qx, qy, "-o")
    else:
        assert len(x) == len(by)
        plt.plot(qx, qy, "-o", label="ALL", color="lightgrey")
        by_fac = np.searchsorted(np.quantile(by, q=[i / by_bins for i in range(1, by_bins)]), by)
        by_ufac = np.unique(by_fac)
        for i, f in enumerate(np.unique(by_ufac)):
            mask = by_fac == f
            nm = f"{i}) {min(by[mask]):.2f} / {max(by[mask]):.2f}"
            qx, qy = qp_data(x[mask], y[mask])
            plt.plot(qx, qy, "-o", label=nm)
        plt.legend()

Now let’s look at all the plots.

from matplotlib import pyplot as plt

fig, axes = plt.subplots(2, 5, figsize=(15, 6))
axes = axes.flatten()
for i in range(Xt.shape[1]):
    plt.sca(axes[i])
    quantile_plot(Xt[:, i], yt)
    axes[i].set_title(i)
plt.tight_layout()
plt.show()

Question: What do you see?

  • Features 0, 4, 5, 7 look fairly linear, despite some noise / irregularities
  • Features 1, 8 don’t look very informative - they’re noisy and with small amplitude

Question: Why does the amplitude matter?

  • Features 2, 6 look informative but not linear
  • Features 3, 9 look like they might be informative but they’re odd

8 Manual engineering

Question: How do you engineer features 2 and 6?

  • Many of you guessed x2^3 and e^x6, which are reasonable guesses
  • If you think the shape is believable and just needs to be approximated well, you can try an expressive transformation like splines

Let’s first see quantile plots after x^3 and e^x transformations:

fig, axes = plt.subplots(1, 2, figsize=(15, 6))
plt.sca(axes[0])
quantile_plot(Xt[:, 2] ** 3, yt)
axes[0].set_title("2 cubed")
plt.sca(axes[1])
quantile_plot(np.exp(Xt[:, 6]), yt)
axes[1].set_title("6 exp")
plt.tight_layout()
plt.show()

They’re fairly linear now.

9 Splines

Let’s try splining them instead.

Let’s compare three models:

  • Baseline (raw features with intercept)
  • With x2 and x6 manually engineered
  • With x2 and x6 automatically engineered with splines
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import SplineTransformer
from sklearn.pipeline import Pipeline

# base model
predictions = {
    "base": predv
}

# manually engineered model
def engineer(X):
    X_new = X.copy()
    X_new[:, 2] = X_new[:, 2] ** 3
    X_new[:, 6] = np.exp(X_new[:, 6])
    return X_new
predictions["manual_x2x6"] = LinearRegression(fit_intercept=True).fit(engineer(Xt), yt).predict(engineer(Xv))

# automatically engineered model using splines
spline = SplineTransformer(degree=3, n_knots=8, knots="quantile", extrapolation="linear", include_bias=False)
pipeline = Pipeline([
    ("spline", ColumnTransformer([
        ("spline", spline, [2, 6]),
        ("pass", "passthrough", [0, 1, 3, 4, 5, 7, 8, 9])])),
    ("ols", LinearRegression(fit_intercept=True))
])
predictions["spline_x2x6"] = pipeline.fit(Xt, yt).predict(Xv)

print(f"number of features in spline model: {pipeline.named_steps['ols'].coef_.shape[0]}")

# compute error
def compute_error_all_predictions():
    return {nm: squared_error(yv, pred) for nm, pred in predictions.items()}
compute_error_all_predictions()
number of features in spline model: 26
{'base': {'mse': 9.595956082319695, 'se': 0.3993419257171409},
 'manual_x2x6': {'mse': 8.827973276451052, 'se': 0.3793705573366164},
 'spline_x2x6': {'mse': 8.819681283292375, 'se': 0.3796052459181792}}

Splines are only slightly worse than manual transformations (with no regularization).

10 What do splines do?

11 Splining more features

Now try splining all “nonlinear” features:

# spline all nonlinear features
pipeline = Pipeline([
    ("spline", ColumnTransformer([
        ("spline", spline, [0, 2, 3, 4, 5, 6, 9]),
        ("pass", "passthrough", [1, 7, 8])])),
    ("ols", LinearRegression(fit_intercept=True))
])
predictions["spline_all"] = pipeline.fit(Xt, yt).predict(Xv)

print(f"number of features in spline model: {pipeline.named_steps['ols'].coef_.shape[0]}")

compute_error_all_predictions()
number of features in spline model: 66
{'base': {'mse': 9.595956082319695, 'se': 0.3993419257171409},
 'manual_x2x6': {'mse': 8.827973276451052, 'se': 0.3793705573366164},
 'spline_x2x6': {'mse': 8.819681283292375, 'se': 0.3796052459181792},
 'spline_all': {'mse': 7.442751406616368, 'se': 0.37486914828513673}}

12 Interactions

Let’s look at all 10 x 10 interactions:

fig, axes = plt.subplots(10, 10, figsize=(75, 50))
for i in range(Xt.shape[1]):
    for j in range(Xt.shape[1]):
        plt.sca(axes[i, j])
        quantile_plot(Xt[:, i], yt, by=Xt[:, j])
        axes[i, j].set_title(f"{i} by {j}")
plt.show()

Question: Which pairs of features have interactions?

Seems like: - 0 by 5, 5 by 0 - 1 by 8, 8 by 1

Other possibility (which I know is incorrect but we will test it): - 1 by 0, 0 by 1

Let’s study each in turn.

13 Interaction between 1 and 8

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 1], yt, by=Xt[:, 8])
axes[0].set_title("1 by 8")
plt.sca(axes[1])
quantile_plot(Xt[:, 8], yt, by=Xt[:, 1])
axes[1].set_title("8 by 1")
plt.show()

Question: What do you see?

  • Interaction seems to be symmetric (pretty much) -> look for f symmetric in y ~ f(x1, x8)
  • Lines pretty much linear within each by group
  • Slope of x1 increases with x8, and vice versa

Question: Which f can cause plots like this?

  • The answer appears to be f(x1, x8) = x1 * x8 (possibly * beta).
  • Think of it as a random effect - the coefficient of x1 is x8 (and vice versa)

Question: What subsequent tests can we do to further support this hypothesis?

14 Susbsequent test 1 – add more by bins

Question: What do you expect to see if the hypothesis is correct?

  • Within each by bin, the relationship should be linear
  • The slopes should increase with by
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 1], yt, by=Xt[:, 8], by_bins=10)
axes[0].set_title("1 by 8")
plt.sca(axes[1])
quantile_plot(Xt[:, 8], yt, by=Xt[:, 1], by_bins=10)
axes[1].set_title("8 by 1")
plt.show()

Question: What do you see?

  • It’s more noisy, but that’s expected because each bin contains fewer observations.
  • It checks out.

15 Subsequent test 2 – check in quantile plot

fig, axes = plt.subplots(1, 3, figsize=(22.5, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 1], yt, by_bins=10)
axes[0].set_title("x1")
plt.sca(axes[1])
quantile_plot(Xt[:, 8], yt)
axes[1].set_title("x8")
plt.sca(axes[2])
quantile_plot(Xt[:, 1] * Xt[:, 8], yt)
axes[2].set_title("x1 x x8")
plt.show()

Look at the VERY BIG aplitude! This was an important feature.

16 Train model with interaction 1-8

def engineer_18(X):
    X_new = (X[:, 1] * X[:, 8]).reshape(-1, 1)
    return np.hstack([X, X_new])

pipeline = Pipeline([
    ("spline", ColumnTransformer([
        ("spline", spline, [0, 2, 3, 4, 5, 6, 9]),
        ("pass", "passthrough", [1, 7, 8, 10])])),
    ("ols", LinearRegression(fit_intercept=True))
])
predictions["spline_int18"] = pipeline.fit(engineer_18(Xt), yt).predict(engineer_18(Xv))

compute_error_all_predictions()
{'base': {'mse': 9.595956082319695, 'se': 0.3993419257171409},
 'manual_x2x6': {'mse': 8.827973276451052, 'se': 0.3793705573366164},
 'spline_x2x6': {'mse': 8.819681283292375, 'se': 0.3796052459181792},
 'spline_all': {'mse': 7.442751406616368, 'se': 0.37486914828513673},
 'spline_int18': {'mse': 3.034177239838048, 'se': 0.19362113044516666},
 'spline_int18_05_linear': {'mse': 1.1357325964525216,
  'se': 0.07024366004927746},
 'spline_int18_05': {'mse': 0.13793116343316109, 'se': 0.0073451236483432275},
 'spline_int_01': {'mse': 7.410697986992731, 'se': 0.37307490009933925}}

17 Interaction between 0 and 5

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 0], yt, by=Xt[:, 5])
axes[0].set_title("0 by 5")
plt.sca(axes[1])
quantile_plot(Xt[:, 5], yt, by=Xt[:, 0])
axes[1].set_title("5 by 0")
plt.show()

Question: What do you see?

  • Interaction seems to be asymmetric -> look for f asymmetric in y ~ f(x1, x8)
  • Something changes abruptly when x0 crosses 0

Question: Which f can cause plots like this?

  • It does not seem to be very easy to tell.
  • The correlation between x5 and y seems to change sign when x0 crosses 0
  • Let’s add by bins and see if that helps
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 0], yt, by=Xt[:, 5], by_bins=10)
axes[0].set_title("0 by 5")
plt.sca(axes[1])
quantile_plot(Xt[:, 5], yt, by=Xt[:, 0], by_bins=10)
axes[1].set_title("5 by 0")
plt.show()

Question: What do you see?

Right-hand side plot: - There seems to be a group of x0 < 0 in which all by_bins are similar, linear and decreasing, and a group of x0 > 0 in which the all by_bins are similar, linear and increasing. - Perhaps learn the two slopes: x5 * I(x0 > 0) and x5 * I(x0 <= 0)

Left-hand side plot: - Generally fits the hypothesis from the right-hand side plot

18 Train model interaction 0-5

Here is a first more naive guess that the relationship is x0*x5

def engineer_18_05(X):
    X_new = np.array([
        X[:, 1] * X[:, 8],
        X[:, 5] * X[:, 0],
    ]).T
    return np.hstack([X, X_new])

pipeline = Pipeline([
    ("spline", ColumnTransformer([
        ("spline", spline, [0, 2, 3, 4, 5, 6, 9]),
        ("pass", "passthrough", [1, 7, 8, 10, 11])])),
    ("ols", LinearRegression(fit_intercept=True))
])
predictions["spline_int18_05_linear"] = pipeline.fit(engineer_18_05(Xt), yt).predict(engineer_18_05(Xv))

compute_error_all_predictions()
{'base': {'mse': 9.595956082319695, 'se': 0.3993419257171409},
 'manual_x2x6': {'mse': 8.827973276451052, 'se': 0.3793705573366164},
 'spline_x2x6': {'mse': 8.819681283292375, 'se': 0.3796052459181792},
 'spline_all': {'mse': 7.442751406616368, 'se': 0.37486914828513673},
 'spline_int18': {'mse': 3.034177239838048, 'se': 0.19362113044516666},
 'spline_int18_05_linear': {'mse': 1.1357325964525216,
  'se': 0.07024366004927746},
 'spline_int18_05': {'mse': 0.13793116343316109, 'se': 0.0073451236483432275},
 'spline_int_01': {'mse': 7.410697986992731, 'se': 0.37307490009933925}}

Here is our better guess on the interaction

def engineer_18_05(X):
    X_new = np.array([
        X[:, 1] * X[:, 8],
        X[:, 5] * (X[:, 0] > 0),
        X[:, 5] * (X[:, 0] <= 0)
    ]).T
    return np.hstack([X, X_new])

pipeline = Pipeline([
    ("spline", ColumnTransformer([
        ("spline", spline, [0, 2, 3, 4, 5, 6, 9]),
        ("pass", "passthrough", [1, 7, 8, 10, 11, 12])])),
    ("ols", LinearRegression(fit_intercept=True))
])
predictions["spline_int18_05"] = pipeline.fit(engineer_18_05(Xt), yt).predict(engineer_18_05(Xv))

compute_error_all_predictions()
{'base': {'mse': 9.595956082319695, 'se': 0.3993419257171409},
 'manual_x2x6': {'mse': 8.827973276451052, 'se': 0.3793705573366164},
 'spline_x2x6': {'mse': 8.819681283292375, 'se': 0.3796052459181792},
 'spline_all': {'mse': 7.442751406616368, 'se': 0.37486914828513673},
 'spline_int18': {'mse': 3.034177239838048, 'se': 0.19362113044516666},
 'spline_int18_05_linear': {'mse': 1.1357325964525216,
  'se': 0.07024366004927746},
 'spline_int18_05': {'mse': 0.13793116343316167, 'se': 0.007345123648343288},
 'spline_int_01': {'mse': 7.410697986992731, 'se': 0.37307490009933925}}

19 Is there overfitting?

Let’s compare to train error again.

predt = pipeline.predict(engineer_18_05(Xt))
squared_error(yt, predt)
{'mse': 0.13803364861610007, 'se': 0.004191258457428897}

Apparently not much.

20 Interaction 0 and 1

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 0], yt, by=Xt[:, 1], by_bins=3)
axes[0].set_title("0 by 1")
plt.sca(axes[1])
quantile_plot(Xt[:, 1], yt, by=Xt[:, 0], by_bins=3)
axes[1].set_title("1 by 0")
plt.show()

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
plt.sca(axes[0])
quantile_plot(Xt[:, 0], yt, by=Xt[:, 1], by_bins=10)
axes[0].set_title("0 by 1")
plt.sca(axes[1])
quantile_plot(Xt[:, 1], yt, by=Xt[:, 0], by_bins=10)
axes[1].set_title("1 by 0")
plt.show()

This interaction seems complicated. Here is a guess we can make: x1 is a polynomial with different coefficients depending on x0. How can we encode this?

Here is a starting guess: f(x1,x0) = x0 * x1 + x0 * x1^2 + x0 * x1^3 + x0 * x1^3

def engineer_01(X):
    X_new = np.array([
        X[:, 1] * X[:, 0],
        X[:, 1]**2 * X[:, 0],
        X[:, 1]**3 * X[:, 0]
    ]).T
    return np.hstack([X, X_new])

pipeline = Pipeline([
    ("spline", ColumnTransformer([
        ("spline", spline, [0, 2, 3, 4, 5, 6, 9]),
        ("pass", "passthrough", [1, 7, 8, 10, 11, 12])])),
    ("ols", LinearRegression(fit_intercept=True))
])
predictions["spline_int_01"] = pipeline.fit(engineer_01(Xt), yt).predict(engineer_01(Xv))

compute_error_all_predictions()
{'base': {'mse': 9.595956082319695, 'se': 0.3993419257171409},
 'manual_x2x6': {'mse': 8.827973276451052, 'se': 0.3793705573366164},
 'spline_x2x6': {'mse': 8.819681283292375, 'se': 0.3796052459181792},
 'spline_all': {'mse': 7.442751406616368, 'se': 0.37486914828513673},
 'spline_int18': {'mse': 3.034177239838048, 'se': 0.19362113044516666},
 'spline_int18_05_linear': {'mse': 1.1357325964525216,
  'se': 0.07024366004927746},
 'spline_int18_05': {'mse': 0.13793116343316167, 'se': 0.007345123648343288},
 'spline_int_01': {'mse': 7.397779730432853, 'se': 0.37204391881342086}}

We see from the results above that the decrease in the test error is very close to the one we had with only the splines. In particular, it is not outside the standard deviations bands we have for that previous error. This suggests that this interaction is not really there.

21 Runner-up

We could not find the exact code which generated the runner ups 24080110_feb9_final.npy file. However, we were able to look at their code to get a general idea of what tried.

21.1 In summary, their strategy was to

1. Generate a lot of features and then; 
2. Prune using ridge regression.

21.2 In detail, they used the following features

21.2.1 Base features

  • Original features (X)
  • Polynomial features (X^2)
  • Square root of absolute values (sign(X) * sqrt(|X|))
  • Logarithmic transformations (log(1 + |X|))
  • Stable reciprocal transformations (1 / (1 + |X|))

21.2.2 Saturating transforms

  • Hyperbolic tangent (tanh(X))
  • Softsign (X / (1 + |X|))
  • Arctangent (arctan(X))

21.2.3 Trigonometric features

  • Sine and cosine of the features (sin(X), cos(X))
  • Higher-order trigonometric terms (sin(2X), cos(2X))

21.2.4 Hinge features

  • Hinge functions:
    • max(0, X - t) and max(0, t - X)
    • thresholds t ∈ {-2, -1, 0, 1, 2}

21.2.5 Radial basis functions (RBFs)

  • Gaussian-like features centered at specific values:
    • exp(-0.5 * ((X - c) / s)^2)
    • where c are centers and s are scales

21.2.6 Global statistics

  • Row-wise: norms, max, mean, standard deviation, and sum of squares of each row in X

21.2.7 Pairwise interactions

  • Elementwise products of selected pairs of features:
    • X[:, i] * X[:, j]

21.2.8 Nonlinear interactions

  • Limited combinations of nonlinear transformations of feature pairs, such as:
    • log1p(|X_i|) * X_j
    • tanh(X_i) * X_j

21.2.9 Random projections

  • Randomly projected features:
    • Z = X @ R, where R is a random matrix
  • Includes interactions and polynomial terms of the projections

There was too much code to paste into this document but their submission was

MSE: 0.5260523711603559

Question: What did the runner-up got right and wrong?

Final remark: automatically generating features may work well on datasets with little noise