import numpy as np
= np.load(r"..\datasets\competition_1_linear_regression\X_train.npy")
X_train = np.load(r"..\datasets\competition_1_linear_regression\y_train.npy")
y_train = np.load(r"..\datasets\competition_1_linear_regression\X_test.npy")
X_test = np.load(r"..\datasets\competition_1_linear_regression\y_test.npy")
y_test
def squared_error(y, pred):
= (y - pred) ** 2
err return {
"avg": np.mean(err),
"se": np.std(err) / np.sqrt(err.shape[0])
}
Lab 4 Presentation - First Competition Round-Up
1 How was the competition for you?
Hard? Entertaining? Stressful? Fun?
What did you learn?
What prevented you from showing your true self?
What are you going to do better next time?
2 Where we’re at
Regression:
OLS, ridge, lasso, elastic net (Unit 1)- Random forests, gradient boosting (Unit 6)
- (Deep learning)
- Kernel methods (Unit 5)
Classification (Unit 3)
Unsupervised learning (Unit 4)
More topics:
Complexity - bias-variance decomposition, VC dimension (Unit 2)- Cross validation (Unit 2)
- Computation (Unit 6)
- (Conformal prediction)
3 Was there noise in the data?
Noise in a linear model can come in (at least) three forms:
- Inherent unpredictability: \[ y = \beta_1 x_1 + \varepsilon \]
- Missing features: \[ y = \beta_1 x_1 + \beta_2 x_2 \]
- Model misspecification: \[ y = f(x_1) \]
Comments:
- Examples of the innovative sources of information: President Trump’s tweets, or pizza delivery traffic near Pentagon
- The first and second kind are virtually indistinguishable. Some statisticians even claim that the first kind does not exist
The data in the first competition had only the third kind of noise. It was possible to reach zero loss.
4 Bias-variance decomposition
Recall from the lecture: \[ \mathbb{E}_{\,}\left[\left(y-\hat f(\boldsymbol{x})\right)^2|\boldsymbol{x}\right] = \text{Var}\left(y|\boldsymbol{x}\right) + \left(\mathbb{E}_{\,}\left[y|\boldsymbol{x}\right] - \mathbb{E}_{\,}\left[\hat f(\boldsymbol{x})|\boldsymbol{x}\right]\right)^2 + \text{Var}\left(\hat f(\boldsymbol{x})|\boldsymbol{x}\right), \] where \(\hat f\) (model trained on training set) is independent of \(y, \boldsymbol{x}\) (test observation).
- The first component is inaccurately referred to as the irreducible error. It would be accurate to say that it is irreducible given the features we currently have, but we may always acquire new features.
- The second component is the bias.
- The third component is the variance.
4.1 Bias-variance tradeoff
It was commonly thought that bias and variance play off of each other in the following way:
(U-shaped plot where x axis is model complexity, y axis is expected test error)
- Low complexity –> high bias, low variance
- High complexity –> low bias, high variance
The game is to control both bias and variance in order to find the sweet spot in the middle which obtains least error.
- Most commonly used models (notable exception: deep learning) play off on this tradeoff. They offer different kinds of tradeoff curves that you can minimize
- Statisticians and computer scientists used to think this was the entire game, but then deep learning came
Deep learning extends the line further right, to extremely high complexity, and the curve drops again (this is called double descent). This regime went overlooked before. Now we say that the U-shape is called the underparameterized regime, compared to the overparameterized regime to its right.
- Since we decided not to teach deep learning this semester, we will not focus on overparameterized learning
- The bias-variance decomposition is a mathematical fact, it remains true in the overparameterized regime. But the interpretation of trading off bias and variance to find the sweet spot is restricted to the underparameterized regime, which is pretty much any other model beside deep learning. And those are widely used in practice in industry and academia
4.2 Controlling bias and variance in the underparameterized regime
Two archtypical methods:
Parsimonious approach
Start from a low-complexity model with high bias and low variance, and start increasing complexity. In terms of complexity of function classes, you use a small function class.
Example: gradient boosting
Expressive + regularization approach
Start from a high-complexity model with low bias and high variance, and use regularization to control complexity. In terms of complexity function classes, you use a (too) big function class, but you look through it restrictively.
Example: random forest
5 Back to the first competition
The two approaches were widely represented in the solutions you submitted:
- Parsimonious - start with OLS on the original raw features, engineer new features manually via exploration, adding only a handful of features to the final model
- Expressive - start by immediately adding tons of polynomial or spline features, tone down using ridge or lass
The two top scoring teams:
- One used the expressive approach
- One used a hybrid between the two approaches - engineer new features manually via exploration, then on top of it add many transformations
We’ll now go over each of the two winning approaches, and then we’ll go over how a parsimonious would approach the data
5.1 Load data
6 Expressive approach - splines and lasso
from sklearn.preprocessing import SplineTransformer
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Lasso
= make_pipeline(
model_expressive =2),
PolynomialFeatures(degree=10),
SplineTransformer(n_knots
StandardScaler(),=0.01)
Lasso(alpha
)
model_expressive.fit(X_train, y_train)= model_expressive.predict(X_test)
pred_expressive
squared_error(y_test, pred_expressive)
{'avg': 0.982071777758838, 'se': 0.031203585216418275}
This team:
- Made zeroth-, first- and second-degree polynomials of the raw features (including interactions)
- On top of those features added third-degree splines with ten knots
- Standardized the features to have average zero and standard deviation one, and the target to have average zero
6.1 Number of parameters
= model_expressive.named_steps["lasso"].coef_
coef print(f"num features: {coef.shape[0]}")
print(f"num nonzero features: {np.sum(coef != 0)}")
num features: 792
num nonzero features: 231
This approach requires serious hyperparameter optimization. The hyperparameters are at least those:
- Polynomial degree
- Spline degree
- Spline knots
- Lasso penalty
Using arbitrary values for these hyperparameters will lead to very bad loss. This team managed to find combinations of these parameters that lead to very low loss.
- The first three parameters increase complexity
- The fourth parameter tones down complexity
Therefore there is an interplay between the parameters and they have to be tuned together.
6.2 Lasso penalty
= 10**np.linspace(-5, 5, 20)
lasso_alphas = []
losses = []
num_nonzero for alpha in lasso_alphas:
= make_pipeline(
model =2),
PolynomialFeatures(degree=3, n_knots=10),
SplineTransformer(degree
StandardScaler(),=alpha)
Lasso(alpha
)
model.fit(X_train, y_train)= model.predict(X_test)
pred "avg"])
losses.append(squared_error(y_test, pred)[
= model.named_steps["lasso"].coef_
coef sum(coef != 0)) num_nonzero.append(np.
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.787e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.543e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.669e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.176e+02, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.028e+02, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
from matplotlib import pyplot as plt
plt.plot(lasso_alphas, losses)"log")
plt.xscale(
plt.show()
plt.plot(lasso_alphas, num_nonzero)"log") plt.xscale(
The first plot is the loss according the (log10 of) lasso penalty. We can see the sweet spot is 0.01.
The second plot is the number of nonzero features. We can see that the number of nonzero features drops to zero when the penalty is large enough
6.3 Interplay between expressiveness and regularization
= 10**np.linspace(-5, -1, 10)
lasso_alphas = [2,5,7,10]
spline_knots = []
losses for alpha in lasso_alphas:
for knots in spline_knots:
= make_pipeline(
model =2),
PolynomialFeatures(degree=knots),
SplineTransformer(n_knots
StandardScaler(),=alpha)
Lasso(alpha
)
model.fit(X_train, y_train)= model.predict(X_test)
pred "avg"]))
losses.append((alpha, knots, squared_error(y_test, pred)[
import pandas as pd
= pd.DataFrame(losses)
losses_df = ["alpha", "knots", "loss"] losses_df.columns
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.557e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.536e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.003e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.787e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.823e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.119e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.866e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.605e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.180e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.827e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.498e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.029e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 9.762e+02, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.564e+02, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.664e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.357e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.083e+02, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.495e+02, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.026e+02, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.063e+02, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.401e+02, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.905e+01, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.198e+02, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.337e+01, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.633e+01, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
import seaborn as sns
=losses_df[losses_df.alpha <= 0.5], x="alpha", y="loss", hue="knots", marker="o")
sns.lineplot(data"log")
plt.xscale(
# Labels and title
"Alpha")
plt.xlabel("Loss")
plt.ylabel("Loss vs Alpha for Different Knots")
plt.title(="Knots")
plt.legend(titleTrue)
plt.grid( plt.show()
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\seaborn\_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\seaborn\_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\seaborn\_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\seaborn\_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
7 Hybrid approach - manually adding many features and ridge
from sklearn.linear_model import RidgeCV
def engineer_data(X):
= np.column_stack([
hand 3] < 0).astype(int),
(X[:, 0 <= X[:, 3]) & (X[:, 3] < 1)).astype(int),
((3] >= 1).astype(int),
(X[:,
])
= np.array([
interactions * X[:, j]
X[:, i] for i in range(X.shape[1])
for j in range(X.shape[1])
]).T
= np.apply_along_axis(lambda x: x**3, 0, X)
cubes = np.apply_along_axis(np.exp, 0, X)
exps = np.apply_along_axis(lambda x: np.sqrt(np.abs(x)), 0, X)
sqrts
return np.hstack([X, hand, interactions, cubes, exps, sqrts])
= map(engineer_data, (X_train, X_test))
engineered_X_train, engineered_X_test
= RidgeCV(alphas=10**np.linspace(-1, 1, 20), store_cv_results=True)
model_parsimonious
model_parsimonious.fit(engineered_X_train, y_train)
= model_parsimonious.predict(engineered_X_test)
pred_parsimonious
squared_error(y_test, pred_parsimonious)
{'avg': 1.0316155775477391, 'se': 0.026829800534337613}
This team made:
- A one-hot representation for a particular feature (X3)
- Added all second-degree polynomials of the raw features (including interactions)
- Added all cubic, exponential and square root transformations of each raw feature
- Fit ridge with parameter selected by cross validation
The loss is:
- Not as good as the other team
- But within two standard errors of each other. So it’s hard to say which approach is actually better
7.1 Number of features
print(f"num features: {engineered_X_train.shape[1]}")
num features: 143
7.2 Ridge penalty
from sklearn.linear_model import Ridge
= 10**np.linspace(-5, 2, 20)
ridge_alphas = []
losses for alpha in ridge_alphas:
= Ridge(alpha=alpha)
model
model.fit(engineered_X_train, y_train)= model.predict(engineered_X_test)
pred "avg"])
losses.append(squared_error(y_test, pred)[
plt.plot(ridge_alphas, losses)"log") plt.xscale(
7.3 What if lasso was used?
= 10**np.linspace(-5, -2, 20)
lasso_alphas = []
losses for alpha in lasso_alphas:
= Lasso(alpha=alpha)
model
model.fit(engineered_X_train, y_train)= model.predict(engineered_X_test)
pred "avg"])
losses.append(squared_error(y_test, pred)[
plt.plot(lasso_alphas, losses)"log") plt.xscale(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.952e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.909e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.845e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.744e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.580e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.295e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.739e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.700e+03, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.919e+02, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.852e+01, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
c:\Users\erezm\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\linear_model\_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.450e+01, tolerance: 1.059e+01
model = cd_fast.enet_coordinate_descent(
We can see that lasso would have slightly improved the predictions.
7.4 Deep dive into feature X3
This team transformed it into a one-hot encoding representing a step function
def quantile_plot(x, y, by=None, bins=10, by_bins=3, y_fn=np.mean):
assert len(x) == len(y)
def qp_data(x, y):
= np.searchsorted(np.quantile(x, q=[i / bins for i in range(1, bins)]), x)
fac = np.unique(fac)
ufac = np.array([np.mean(x[fac == f]) for f in ufac])
qx = np.array([y_fn(y[fac == f]) for f in ufac])
qy return qx, qy
= qp_data(x, y)
qx, qy if by is None:
"-o")
plt.plot(qx, qy, else:
assert len(x) == len(by)
"-o", label="ALL", color="lightgrey")
plt.plot(qx, qy, = np.searchsorted(np.quantile(by, q=[i / by_bins for i in range(1, by_bins)]), by)
by_fac = np.unique(by_fac)
by_ufac for i, f in enumerate(np.unique(by_ufac)):
= by_fac == f
mask = f"{i}) {min(by[mask]):.2f} / {max(by[mask]):.2f}"
nm = qp_data(x[mask], y[mask])
qx, qy "-o", label=nm)
plt.plot(qx, qy, plt.legend()
3], y_train) quantile_plot(X_train[:,
Is this a step function? Let’s increase the number of bins, increasing granularity but also making the plot more noisy because there are less observations in each bin
3], y_train, bins=50) quantile_plot(X_train[:,
- Looks like the local variabilty in each of three regions is smaller than the differences between the regions
- It seems reasonable to assume at this point that there are three distinct regions with different behaviors
- Is it reasonable to assume the function is constant in each region, like this team did?
- I would say that at least this would be a reasonable approximation
8 Ensembling
We will now average the predictions of the two teams, and voila:
= (pred_expressive + pred_parsimonious) / 2
pred_combined squared_error(y_test, pred_combined)
{'avg': 0.9016597970552396, 'se': 0.023160191674497673}
- The test score has declined and it is now three standard deviations below the previous top score
- This is known as ensembling. Nowadays its rare to find a Kaggle winner that did not use ensembling between different methods
8.1 Why does ensembling work?
Claim. If \(\hat Y_1\) and \(\hat Y_2\) are independent unbiased estimators of \(y\), then \((\hat Y_1 + \hat Y_2)/2\) is an unbiased estimator of \(y\) and \[ \text{Var}\left(\frac{\hat Y_1 + \hat Y_2}{2}\right) = \frac{\text{Var}\left(\hat Y_1\right) + \text{Var}\left(\hat Y_2\right)}{4}. \]
- Thus, if \(\max\left(\text{Var}\left(\hat Y_1\right), \text{Var}\left(\hat Y_2\right)\right) < 3\cdot \min\left(\text{Var}\left(\hat Y_1\right), \text{Var}\left(\hat Y_2\right)\right)\), then \(\text{Var}\left(\frac{\hat Y_1 + \hat Y_2}{2}\right) < \min\left(\text{Var}\left(\hat Y_1\right), \text{Var}\left(\hat Y_2\right)\right)\).
- If the variances are similar, the averaged estimator has lower variance than each individual estimator
- By the bias-variance decomposition, it achieved lower expected test loss!
- If the variances are not similar, for example if one estimator is very clean (low variance) and the other is very noisy, averaging them will contaminate the clean estimator and won’t result in improvement over the clear estimator
- In that case, you can always get a better averaged estimator than each individual estimator by averaging with unequal weights (i.e. putting less weight on the noisy one)
8.2 Ensembling is a method of reducing variance
Example of model that inherently does ensembling: random forest.
8.3 Question: do we have any reason to believe the two predictions are independent or unbiased?
- No. But they’re using different methods, so we might guess that they are “not very biased” and “not too dependent”.
- Then we try and see if it worked, by computing loss on a validation set
9 Looking into the data (parsimonious approach)
9.1 Split training data into two:
= X_train[:6000,:], y_train[:6000]
Xa, ya = X_train[6000:,:], y_train[6000:] Xb, yb
def OLS(X, y):
return np.linalg.inv(X.T @ X) @ (X.T @ y)
def transform(X, feature_maps):
return np.array([f(X) for f in feature_maps]).T
def fit_and_evaluate(Xa, ya, Xb, yb, feature_maps):
= OLS(transform(Xa, feature_maps), ya)
b = transform(Xb, feature_maps) @ b
pred return squared_error(yb, pred)["avg"]
= [lambda X: np.ones(X.shape[0])]
feature_maps
fit_and_evaluate(Xa, ya, Xb, yb, feature_maps)
13.039256026455085
10 Start exploring
0], ya) quantile_plot(Xa[:,
lambda X: X[:,0])
feature_maps.append(
fit_and_evaluate(Xa, ya, Xb, yb, feature_maps)
12.69371546643529
1], ya) quantile_plot(Xa[:,
2], ya) quantile_plot(Xa[:,
2] ** 3, ya) quantile_plot(Xa[:,
lambda X: X[:,2] ** 3)
feature_maps.append(
fit_and_evaluate(Xa, ya, Xb, yb, feature_maps)
12.016124664896074
3], ya) quantile_plot(Xa[:,
lambda X: X[:,3] <= 0)
feature_maps.append(lambda X: X[:,3] > 1)
feature_maps.append(
fit_and_evaluate(Xa, ya, Xb, yb, feature_maps)
11.08758042106949
4], ya) quantile_plot(Xa[:,
lambda X: X[:,4])
feature_maps.append(
fit_and_evaluate(Xa, ya, Xb, yb, feature_maps)
10.237321352723928
5], ya) quantile_plot(Xa[:,
lambda X: X[:,5])
feature_maps.append(
fit_and_evaluate(Xa, ya, Xb, yb, feature_maps)
9.517495255911795
6], ya) quantile_plot(Xa[:,
6]), ya) quantile_plot(np.exp(Xa[:,
lambda X: np.exp(X[:,6]))
feature_maps.append(
fit_and_evaluate(Xa, ya, Xb, yb, feature_maps)
7.9719145561490095
7], ya) quantile_plot(Xa[:,
Unclear what appropriate transformation, try including as is
lambda X: X[:,7])
feature_maps.append(
fit_and_evaluate(Xa, ya, Xb, yb, feature_maps)
7.958626495914741
Loss didn’t improve significantly! Something’s wrong. Perhaps contribution of X7 already included in previous features
= transform(Xa, feature_maps[1:-1])
Xa_transformed
for i in range(Xa_transformed.shape[1]):
=Xa[:,7])
quantile_plot(Xa_transformed[:,i], ya, by plt.show()
4], ya, by=Xa[:,7])
quantile_plot(Xa_transformed[:,
plt.show()7], ya, by=Xa_transformed[:,4])
quantile_plot(Xa[:, plt.show()
= feature_maps[:-1] feature_maps
8], ya) quantile_plot(Xa[:,
Not sure what to do with this one. Let’s leave it till the end
9], ya) quantile_plot(Xa[:,
9], ya, bins=50) quantile_plot(Xa[:,
9], ya, bins=100) quantile_plot(Xa[:,
lambda X: X[:,9])
feature_maps.append(lambda X: (X[:,9] + 0.5) * (X[:,9] > -0.5))
feature_maps.append(lambda X: (X[:,9] - 0.5) * (X[:,9] > 0.5))
feature_maps.append(
fit_and_evaluate(Xa, ya, Xb, yb, feature_maps)
7.318767364489894
11 Interactions
for i in range(1, Xa.shape[1]):
0], ya, by=Xa[:,i])
quantile_plot(Xa[:, plt.show()
0], ya, by=Xa[:,5])
quantile_plot(Xa[:,
plt.show()5], ya, by=Xa[:,0])
quantile_plot(Xa[:, plt.show()
5], ya, Xa[:,0], by_bins=10) quantile_plot(Xa[:,
lambda X: X[:,5] * (X[:,0] > 0))
feature_maps.append(
fit_and_evaluate(Xa, ya, Xb, yb, feature_maps)
4.469515803224729
for i in range(Xa.shape[1]):
1], ya, by=Xa[:,i])
quantile_plot(Xa[:, plt.show()
1], ya, Xa[:,8])
quantile_plot(Xa[:,
plt.show()8], ya, Xa[:,1]) quantile_plot(Xa[:,
1], ya, Xa[:,8], by_bins=10)
quantile_plot(Xa[:,
plt.show()8], ya, Xa[:,1], by_bins=10) quantile_plot(Xa[:,
lambda X: X[:,1] * X[:,8])
feature_maps.append(
fit_and_evaluate(Xa, ya, Xb, yb, feature_maps)
0.018292521300666345