import pandas as pd
import numpy as np
from matplotlib import pyplot as pltLab 10 - Classification comptetition Round Up
1 Load data
dir_path = r"../../datasets/diabetes/"
X = pd.read_parquet(f"{dir_path}X_train.parquet")
y = pd.read_parquet(f"{dir_path}y_train.parquet").squeeze()
X.shape, y.shape((200000, 21), (200000,))
X_test = pd.read_parquet(f"{dir_path}X_test.parquet")When you look at just these raw statistics from the training data, one thing should pop out to you immediately: you have massively more observations than features. This is the scenario where, if you have a prediction task, like we have here, you should throw your statistical hat out and put on an ML hat.
What is the key difference here? The key difference is that, with this ratio between features and sample points, you should try to be as data-driven as possible. What this means is, instead of trying to explicitly encode strong “priors” (in an informal, not bayesian sense) on what the predictor should look like (e.g. linear, linear with a hand crafted feature and so on), since you have this much data, you should just use the most expressive models you can and let them figure out the right “priors”.
This works because you have much more data then features. Even if at the end of the day your model is merely a linear or quadratic function of the features, with this amount of data, a very expressive model will be able to approximate that as accurately as needed for the prediction task. The added cost of doing manual feature engineering is note that significant, as we will see.
1.1 Checking features
{col: str(np.sort(X[col].unique()).tolist()) for col in X.columns}{'HighBP': '[0, 1]',
'HighChol': '[0, 1]',
'CholCheck': '[0, 1]',
'BMI': '[12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 86, 87, 88, 89, 90, 91, 92, 95, 96, 98]',
'Smoker': '[0, 1]',
'Stroke': '[0, 1]',
'HeartDiseaseorAttack': '[0, 1]',
'PhysActivity': '[0, 1]',
'Fruits': '[0, 1]',
'Veggies': '[0, 1]',
'HvyAlcoholConsump': '[0, 1]',
'AnyHealthcare': '[0, 1]',
'NoDocbcCost': '[0, 1]',
'GenHlth': '[1, 2, 3, 4, 5]',
'MentHlth': '[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]',
'PhysHlth': '[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]',
'DiffWalk': '[0, 1]',
'Sex': '[0, 1]',
'Age': '[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]',
'Education': '[1, 2, 3, 4, 5, 6]',
'Income': '[1, 2, 3, 4, 5, 6, 7, 8]'}
features_binary = [col for col in X.columns if X[col].nunique() == 2]
features_numeric = [col for col in X.columns if X[col].nunique() > 2]
features_numeric, features_binary(['BMI', 'GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income'],
['HighBP',
'HighChol',
'CholCheck',
'Smoker',
'Stroke',
'HeartDiseaseorAttack',
'PhysActivity',
'Fruits',
'Veggies',
'HvyAlcoholConsump',
'AnyHealthcare',
'NoDocbcCost',
'DiffWalk',
'Sex'])
1.2 Class balance
np.mean(y)0.139155
### Check data types
X.dtypesHighBP int64
HighChol int64
CholCheck int64
BMI int64
Smoker int64
Stroke int64
HeartDiseaseorAttack int64
PhysActivity int64
Fruits int64
Veggies int64
HvyAlcoholConsump int64
AnyHealthcare int64
NoDocbcCost int64
GenHlth int64
MentHlth int64
PhysHlth int64
DiffWalk int64
Sex int64
Age int64
Education int64
Income int64
dtype: object
Important pre-processing steps: some columns have categorical values and some have numeric. We need to convert them first.
ordinal_categorical_cols = ["GenHlth", "Education", "Income"]
nominal_categorical_cols = features_binary
categorical_cols = nominal_categorical_cols + ordinal_categorical_cols
# Convert in X
for col in nominal_categorical_cols:
X[col] = X[col].astype("category")
for col in ordinal_categorical_cols:
X[col] = pd.Categorical(
X[col],
categories=sorted(X[col].dropna().unique()),
ordered=True
)
# Keep X_test aligned (recommended)
for col in nominal_categorical_cols:
X_test[col] = pd.Categorical(X_test[col], categories=X[col].cat.categories)
for col in ordinal_categorical_cols:
X_test[col] = pd.Categorical(
X_test[col],
categories=X[col].cat.categories,
ordered=True
)
# Quick check
X.dtypesHighBP category
HighChol category
CholCheck category
BMI int64
Smoker category
Stroke category
HeartDiseaseorAttack category
PhysActivity category
Fruits category
Veggies category
HvyAlcoholConsump category
AnyHealthcare category
NoDocbcCost category
GenHlth category
MentHlth int64
PhysHlth int64
DiffWalk category
Sex category
Age int64
Education category
Income category
dtype: object
Here a crucial decision was made. This is a place in which stats and ML differ significantly. Note that GenHlth, Education and Income in principle could be treated as either ordinal or numeric. I explicitly decided to encode them as categorical. Why? That’s because categorical variables are the least informative way of encoding the variables. If you encode them as numeric or ordinal, you are implicitly assuming some form of spatial dependency. If you were in a small sample/features regime, adding in this assumption explicitly would be important to leverage you lack of samples. However here since we have soo many samples, it is better to encode the variable in the most agnostic way possible: we do not assume any relation between the different categories. Instead we will allow the algorithm to leverage the large sample to figure any relations that might be there. In some sense encoding too much prior assumptions when you have this many data can actually work against you, as we will see shortly. Think about LLMs for example. At the very base level, you feed them tokens which are essentially one-hot encoded. Either this or you encode the tokens using some pre-trained model, which itself was trained initially having all tokens one-hot encoded. Think about this for me a moment. As humans we have a bunch of prior knowledge about which tokens should be similar to one another, and yet when training these things, we completely forget about this, and just feed the tokens one-hot encoded as if each were completely unrelated to the others. The trick is that the large sample takes care of figuring out the necessary relations for the task in a way that’s more effective than encoding them a priori.
Question: why didn’t I do the same thing for the other numerical variables? And more broadly, why shouldn’t I do that for an arbitrary numerical variable?
features_cat = [col for col in X.columns if str(X[col].dtype) == "category"]
features_numeric = [col for col in X.columns if str(X[col].dtype) == "int64"]features_cat, features_numeric(['HighBP',
'HighChol',
'CholCheck',
'Smoker',
'Stroke',
'HeartDiseaseorAttack',
'PhysActivity',
'Fruits',
'Veggies',
'HvyAlcoholConsump',
'AnyHealthcare',
'NoDocbcCost',
'GenHlth',
'DiffWalk',
'Sex',
'Education',
'Income'],
['BMI', 'MentHlth', 'PhysHlth', 'Age'])
2 Model training and evaluation scaffolding
def make_cv_pairs(n_folds, seed, X=X, y=y):
fold = np.resize(np.arange(n_folds), X.shape[0])
rng = np.random.default_rng(seed)
rng.shuffle(fold)
cv_pairs = [{
"X_train": X.loc[fold != i, :],
"y_train": y.loc[fold != i],
"X_valid": X.loc[fold == i, :],
"y_valid": y.loc[fold == i],
} for i in np.unique(fold)]
return cv_pairs
# Example usage
nfolds = 5
cv_pairs = make_cv_pairs(n_folds=nfolds, seed=42)
[(entry["X_train"].shape[0], entry["X_valid"].shape[0]) for entry in cv_pairs][(160000, 40000),
(160000, 40000),
(160000, 40000),
(160000, 40000),
(160000, 40000)]
from sklearn.metrics import f1_score, precision_recall_curve
def fit_predict_score(X_train, y_train, X_valid, y_valid, model, in_sample=False):
model.fit(X_train, y_train)
thresh = find_best_threshold(y_train, predict(model, X_train))
if in_sample:
pred = predict(model, X_train) > thresh
return f1_score(y_train, pred)
else:
pred = predict(model, X_valid) > thresh
return f1_score(y_valid, pred)
def predict(model, X):
if hasattr(model, "predict_proba"):
return model.predict_proba(X)[:,1]
elif hasattr(model, "decision_function"):
return model.decision_function(X)
else:
return model.predict(X)
def find_best_threshold(y, pred):
y_arr = np.asarray(y)
pred_arr = np.asarray(pred)
valid_mask = np.isfinite(y_arr) & np.isfinite(pred_arr)
if not np.any(valid_mask):
return 0.5
precision, recall, thresholds = precision_recall_curve(y_arr[valid_mask], pred_arr[valid_mask])
# align with thresholds (precision/recall are one element longer)
f1 = 2 * precision[:-1] * recall[:-1] / np.where(
precision[:-1] + recall[:-1] == 0, 1, precision[:-1] + recall[:-1]
)
finite_f1 = np.isfinite(f1)
if not np.any(finite_f1):
return 0.5
valid_idx = np.where(finite_f1)[0]
best_idx = valid_idx[np.argmax(f1[finite_f1])]
return thresholds[best_idx]
def cv_workflow(model, seed=42, nfolds=5, **kwargs):
cv_pairs = make_cv_pairs(n_folds=nfolds, seed=seed)
if nfolds == 2:
scores = [fit_predict_score(model=model, **cv_pairs[0], **kwargs)]
else:
scores = [fit_predict_score(model=model, **entry, **kwargs) for entry in cv_pairs]
return np.mean(scores)2.1 Simple training pipeline
from sklearn.preprocessing import SplineTransformer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from datetime import datetime
from sklearn.preprocessing import OneHotEncoderpredictors = {
"logistic": LogisticRegression(C=1),
"lda": LinearDiscriminantAnalysis(shrinkage="auto", solver="lsqr"),
"svm": LinearSVC(C=0.01, dual="auto")
}
def build_model(predictor):
transformer_numeric = Pipeline(steps=[
('spline', SplineTransformer(degree=3, n_knots=20)),
('scaler', StandardScaler())
])
transformer_categorical = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='if_binary'))
])
preprocessor = ColumnTransformer(
transformers=[
('numeric', transformer_numeric, features_numeric),
('categorical', transformer_categorical, features_cat)
]
)
model = Pipeline([
("preprocessor", preprocessor),
("predictor", predictor),
])
return model
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")
for nm, predictor in predictors.items():
model = build_model(predictor)
score = cv_workflow(model, nfolds=2)
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")[2026-04-13 09:42:10] end
[2026-04-13 09:42:15] logistic=0.4706737896120605
[2026-04-13 09:42:19] lda=0.4628085249768343
[2026-04-13 09:42:34] svm=0.4705021544131595
[2026-04-13 09:42:34] end
The results for the same training run if we encode as categorical only the binary variables, and the rest as numeric (keeping the splines) is the following:
[2026-04-11 13:52:40] logistic=0.4611625371039416 [2026-04-11 13:52:43] lda=0.45808937405668904 [2026-04-11 13:52:47] svm=0.4554535702471137
Note how the more expressive models like SVM and logistic regression get significant gains in improvement from this elementary change.
With just this raw logistic model with no hparam tuning already placed me in the top 30.
predictor = LogisticRegression(C=1)
model = build_model(predictor)
model.fit(X, y)
thresh = find_best_threshold(y, predict(model, X))
y_pred = predict(model, X_test) > thresh
np.save("lucas_baseline_pred.npy", y_pred)Following up on our conceptual intuition, I will completely ignore LDA and QDA because both models are less expressive and would be overly reliant on manually adding new features.
To start our evaluation of the logistic regression model we will see whether it is overfitting or not, to see whether we are lacking expressivity or regularization.
predictor = LogisticRegression(C=100000)
model = build_model(predictor)
train_score = cv_workflow(model, seed=10, in_sample=True)
print(f"train f1 score: {train_score}")
test_error = cv_workflow(model, seed=10, in_sample=False)
print(f"test f1 score: {test_error}")train f1 score: 0.47144534867743404
test f1 score: 0.46981010798112
Not much overfitting going on, which means the logistic regression is not expressive enough as it is. We will tune its hparams any way, just to see if there are any meaningful gains. Since the regulazation is already weak, the results above should tell us we are pretty close to the best we can get for this combination of logistic+features.
Let us check the sampling variability of the error using the bootstrap to get a sense of what is a meaningful change in the CV error. We will assume it does not change that meaningfully across models.
from tqdm import tqdm
scores = []
predictor = LogisticRegression(C=1, max_iter=100)
model = build_model(predictor)
for seed in tqdm(range(0, 1000, 100)):
test_error = cv_workflow(model, seed=seed, in_sample=False)
scores.append(test_error)
print(f"Std test error: {np.std(scores)}")100%|██████████| 10/10 [07:49<00:00, 46.96s/it]
Std test error: 0.0003585941661384227
3 Tune hparams for logistic
I will only try to tune regularization strength and class weights.
Finding good scales for regulatization
grid_C = [1e-3, 1e-1, 1, 1e1, 1e3, 1e6]
predictors = {
C: LogisticRegression(C=C)
for C in [0.1, 1, 1e1, 1e3, 1e6]
}
for nm, predictor in predictors.items():
model = build_model(predictor)
score = cv_workflow(model, nfolds=2, seed=10, in_sample=False)
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")[2026-04-13 09:49:01] 0.1=0.47155742969644504
[2026-04-13 09:49:06] 1=0.4716089205272232
[2026-04-13 09:49:13] 10.0=0.4714083372677535
[2026-04-13 09:49:18] 1000.0=0.47155917414621945
[2026-04-13 09:49:25] 1000000.0=0.47155917414621945
[2026-04-13 09:49:25] end
# formula from sklearn for class weight
base_class_weight = 1/(2*np.mean(y))
print(base_class_weight)
class_weight_grid = [0.75, 1, 3.6, 10]3.5931155905285475
predictors = {
cw: LogisticRegression(C=1, class_weight={1:cw, 0:1/(2-1/cw)})
for cw in class_weight_grid
}
for nm, predictor in predictors.items():
model = build_model(predictor)
score = cv_workflow(model, nfolds=2, seed=10, in_sample=False)
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")[2026-04-13 09:50:51] 0.75=0.471191066306497
[2026-04-13 09:50:57] 1=0.4716089205272232
[2026-04-13 09:51:01] 3.6=0.4711612126518551
[2026-04-13 09:51:07] 10=0.46974615507423767
[2026-04-13 09:51:07] end
from itertools import product
updated_C_grid = [1e-1, 1, 10]
updated_weight_grid = [0.75, 1, 3.6]
grid = list(product(updated_C_grid, updated_weight_grid))
predictors = {
f"(C={C}, cw={cw})": LogisticRegression(C=C, class_weight={1:cw, 0:1/(2-1/cw)})
for C, cw in grid
}
for nm, predictor in predictors.items():
model = build_model(predictor)
score = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")[2026-04-13 09:56:30] (C=0.1, cw=0.75)=0.4701561141414578
[2026-04-13 09:56:50] (C=0.1, cw=1)=0.4700004190191736
[2026-04-13 09:57:10] (C=0.1, cw=3.6)=0.4696269553925762
[2026-04-13 09:57:32] (C=1, cw=0.75)=0.47006408953876333
[2026-04-13 09:57:55] (C=1, cw=1)=0.46993197092635947
[2026-04-13 09:58:18] (C=1, cw=3.6)=0.4696297170608033
[2026-04-13 09:58:47] (C=10, cw=0.75)=0.4699358590786124
[2026-04-13 09:59:17] (C=10, cw=1)=0.46949377588414504
[2026-04-13 09:59:43] (C=10, cw=3.6)=0.4696041031786275
[2026-04-13 09:59:43] end
logistic_hparams = {
"C": 1,
"class_weight": {1: 0.75, 0: 1/(2-1/0.75)}
}SVM:
grid_C = [1e-3, 1e-1, 1, 1e1, 1e3, 1e6]
predictors = {
C: LinearSVC(C=C, dual="auto")
for C in grid_C
}
for nm, predictor in predictors.items():
model = build_model(predictor)
score = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")[2026-04-13 10:02:51] 0.001=0.46638710690409685
[2026-04-13 10:03:38] 0.1=0.4672905743090041
[2026-04-13 10:04:28] 1=0.4674689112763988
[2026-04-13 10:05:27] 10.0=0.46750884270792076
[2026-04-13 10:06:25] 1000.0=0.46745973596271434
[2026-04-13 10:07:22] 1000000.0=0.46745973596271434
[2026-04-13 10:07:22] end
predictors = {
cw: LinearSVC(C=1, class_weight={1:cw, 0:1/(2-1/cw)})
for cw in class_weight_grid
}
for nm, predictor in predictors.items():
model = build_model(predictor)
score = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")[2026-04-13 10:09:55] 0.75=0.4666714366569618
[2026-04-13 10:10:44] 1=0.4674689112763988
[2026-04-13 10:12:16] 3.6=0.4698845813182954
[2026-04-13 10:13:19] 10=0.4672258720079015
[2026-04-13 10:13:19] end
updated_C_grid = [0.1, 1, 10]
updated_weight_grid = [0.75, 1, 3.6]
grid = list(product(updated_C_grid, updated_weight_grid))
predictors = {
f"(C={C}, cw={cw})": LinearSVC(C=C, class_weight={1:cw, 0:1/(2-1/cw)})
for C, cw in grid
}
for nm, predictor in predictors.items():
model = build_model(predictor)
score = cv_workflow(model, nfolds=2, seed=10, in_sample=False)
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")[2026-04-13 10:16:02] (C=0.1, cw=0.75)=0.46818788894887897
[2026-04-13 10:16:21] (C=0.1, cw=1)=0.4696046478673657
[2026-04-13 10:16:49] (C=0.1, cw=3.6)=0.47180109157064887
[2026-04-13 10:17:04] (C=1, cw=0.75)=0.4682738230611781
[2026-04-13 10:17:34] (C=1, cw=1)=0.46964503243717953
[2026-04-13 10:18:21] (C=1, cw=3.6)=0.4713025798829574
[2026-04-13 10:18:39] (C=10, cw=0.75)=0.4682605111294312
[2026-04-13 10:19:22] (C=10, cw=1)=0.46948197241340245
[2026-04-13 10:25:44] (C=10, cw=3.6)=0.47134772814680514
[2026-04-13 10:25:44] end
svc_hparams = {
"C": 0.1,
"class_weight": {1: 3.6, 0: 1/(2-1/3.6)}
}Let us check overfitting for the best models soo far to see if we need more expressivity
logistic = LogisticRegression(**logistic_hparams)
model = build_model(logistic)
train_score = cv_workflow(model, nfolds=3, seed=10, in_sample=True)
print(f"train f1 logistic: {train_score}")
test_error = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
print(f"test f1 logistic: {test_error}")
svc = LinearSVC(**svc_hparams)
model = build_model(svc)
train_score = cv_workflow(model, nfolds=3, seed=10, in_sample=True)
print(f"train f1 svc: {train_score}")
test_error = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
print(f"test f1 svc: {test_error}")train f1 logistic: 0.47138173560527696
test f1 logistic: 0.47006408953876333
train f1 svc: 0.47117195020562536
test f1 svc: 0.46997019485179575
We have very little overfitting. This suggests the methods could use more expressivity. One approach we could take now would be to manually add in more features, as some of you did. That’s a reasonable approach and it led to some of the best scores. I however opt for a more parsimonious approach in light of what I elaborated on early on. I believe our feature representation is already expressive enough. What we need is a method that is able to natively explore more non-linear functions of those features. That suggests trying boosting.
4 Gradient boosting
Sklearn has two implementations of gradient boosting. I am picking the most efficient to run so I can explore more hparams quickly.
from sklearn.ensemble import HistGradientBoostingClassifier
model = HistGradientBoostingClassifier()
score = cv_workflow(model, nfolds=2)/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
l2_grid = [0, 1, 10]
lr_grid = [0.01, 0.1, 1]
max_lfeaves_grid = [30, 100]
class_weight_grid = [0.75, 1, 2, 3.6]
full_grid = list(product(l2_grid, lr_grid, max_lfeaves_grid, class_weight_grid))
results = {}
for l2, lr, max_leaves, cw in full_grid:
model = HistGradientBoostingClassifier(
l2_regularization=l2,
learning_rate=lr,
max_leaf_nodes=max_leaves,
class_weight={1:cw, 0:1/(2-1/cw)}
)
score = cv_workflow(model, nfolds=2, seed=10, in_sample=False)
results[(l2, lr, max_leaves, cw)] = score
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] (l2={l2}, lr={lr}, max_leaves={max_leaves}, cw={cw})={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")sorted_results = sorted(results.items(), key=lambda x: x[1], reverse=True)
sorted_results[((1, 0.1, 30, 1), 0.47397189708954585),
((1, 0.1, 30, 3.6), 0.4739598881862838),
((1, 0.1, 30, 0.75), 0.47355342563010294),
((10, 0.1, 30, 1), 0.4734924659730116),
((10, 0.1, 30, 3.6), 0.4731454213995232),
((0, 0.1, 30, 3.6), 0.47312541473125413),
((10, 0.1, 100, 2), 0.4727006428164583),
((0, 0.1, 30, 2), 0.4725328147788041),
((0, 0.1, 100, 3.6), 0.47230571060541005),
((10, 0.1, 30, 2), 0.4721347790602775),
((1, 0.1, 30, 2), 0.4718932443703086),
((1, 0.1, 100, 3.6), 0.47136248485516025),
((0, 0.1, 30, 1), 0.4712298078056833),
((1, 0.1, 100, 2), 0.47118556553168295),
((10, 0.1, 100, 3.6), 0.47114688594840315),
((1, 0.01, 100, 0.75), 0.4706592332892695),
((0, 0.01, 100, 1), 0.47044522968197877),
((10, 0.1, 30, 0.75), 0.47032138333284235),
((10, 0.1, 100, 0.75), 0.47022146407458604),
((0, 0.1, 30, 0.75), 0.47020167003285807),
((0, 0.1, 100, 2), 0.4700665188470067),
((1, 0.01, 100, 1), 0.4699048462049126),
((0, 0.01, 100, 2), 0.46963495575221237),
((1, 0.1, 100, 1), 0.4695717074290963),
((10, 0.01, 100, 1), 0.4693586968903064),
((1, 0.1, 100, 0.75), 0.4692389758179232),
((10, 0.01, 100, 0.75), 0.4691295970511854),
((10, 0.01, 100, 2), 0.4691264857289168),
((1, 0.01, 100, 2), 0.46912533559417835),
((0, 0.01, 100, 0.75), 0.46909183413889594),
((0, 0.1, 100, 1), 0.46778086712090194),
((0, 0.01, 100, 3.6), 0.46741679622651455),
((1, 0.01, 100, 3.6), 0.4673497231213718),
((10, 0.01, 100, 3.6), 0.4670399133104569),
((1, 0.01, 30, 0.75), 0.46703837323712694),
((10, 0.01, 30, 1), 0.4665549615681268),
((10, 0.01, 30, 0.75), 0.4657234339509045),
((1, 0.01, 30, 1), 0.46565627680869315),
((0, 0.01, 30, 0.75), 0.46562022503125433),
((10, 0.1, 100, 1), 0.4653063752051274),
((1, 0.01, 30, 2), 0.46506200645844076),
((0, 0.01, 30, 1), 0.4645537497975271),
((0, 0.01, 30, 2), 0.4643511943189154),
((10, 0.01, 30, 2), 0.46387301915570145),
((0, 0.1, 100, 0.75), 0.4635676861216035),
((1, 0.01, 30, 3.6), 0.46193345559339793),
((0, 0.01, 30, 3.6), 0.4619190209608124),
((10, 0.01, 30, 3.6), 0.4615306433580648),
((10, 1, 30, 2), 0.45838148659695016),
((10, 1, 30, 3.6), 0.4577893354065575),
((1, 1, 30, 2), 0.4569364312074432),
((10, 1, 30, 0.75), 0.4559430071241095),
((1, 1, 30, 3.6), 0.45558448677538715),
((10, 1, 30, 1), 0.4555018972059331),
((0, 1, 30, 1), 0.4540675302151645),
((0, 1, 30, 3.6), 0.45280931433237626),
((1, 1, 30, 1), 0.4518628110289173),
((0, 1, 30, 2), 0.4518586628136083),
((10, 1, 100, 3.6), 0.4484416539397025),
((10, 1, 100, 2), 0.44667545360217614),
((0, 1, 100, 2), 0.44229451585496204),
((0, 1, 100, 3.6), 0.4418382806800965),
((1, 1, 100, 3.6), 0.4410955171004224),
((1, 1, 30, 0.75), 0.44021363145806236),
((0, 1, 30, 0.75), 0.4394418604651163),
((10, 1, 100, 0.75), 0.43937093934057925),
((10, 1, 100, 1), 0.43920722017330815),
((0, 1, 100, 1), 0.4386407558514065),
((1, 1, 100, 2), 0.4374098228198765),
((0, 1, 100, 0.75), 0.4219705041215724),
((1, 1, 100, 0.75), 0.4197631686402613),
((1, 1, 100, 1), 0.41947907826881653)]
recompute error estimates to get more reliable results
top_10 = [{"l2": key[0], "lr": key[1], "max_leaves": key[2], "cw": key[3]} for key, _ in sorted_results[:10]]
for param in top_10:
model = HistGradientBoostingClassifier(
l2_regularization=param["l2"],
learning_rate=param["lr"],
max_leaf_nodes=param["max_leaves"],
class_weight={1:param["cw"], 0:1/(2-1/param["cw"])}
)
test_error = cv_workflow(model, nfolds=10, seed=10, in_sample=False)
print(f"params: {param}")
print(f"test f1 grad boost: {test_error}")The best hparams from this second more accurate run were “l2”: 1, “lr”: 0.1, “max_leaves”: 30, “cw”: 3.6. Let us do a final overfitting sanity check.
grad_boost_hparams = {"l2": 1, "lr": 0.1, "max_leaves": 30, "cw": 3.6}
model = HistGradientBoostingClassifier(
l2_regularization=grad_boost_hparams["l2"],
learning_rate=grad_boost_hparams["lr"],
max_leaf_nodes=grad_boost_hparams["max_leaves"],
class_weight={1:grad_boost_hparams["cw"], 0:1/(2-1/grad_boost_hparams["cw"])}
)
train_score = cv_workflow(model, nfolds=10, seed=5, in_sample=True)
print(f"train f1 grad boost: {train_score}")
test_error = cv_workflow(model, nfolds=10, seed=5, in_sample=False)
print(f"test f1 grad boost: {test_error}")/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
train f1 grad boost: 0.4836779370286172
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
test f1 grad boost: 0.4721004154120754
model.fit(X, y)
thresh = find_best_threshold(y, predict(model, X))
y_pred = predict(model, X_test) > thresh
np.save("lucas_final_pred.npy", y_pred)/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
warnings.warn(
We are indeed overfitting. If we wanted to try to get an even better score we could try now to reduce overfitting a little bit by thinking about principled ways of reducing regularization. But in principle our hparam sweep over the boosting parameters already did that, and larger regularization did not seem to improve the final results. What seems to be clear though, is that our exploration of boosting is giving our diminishing returns at this point already. There are some tweaks we could try inside the boosting world, as it is the most advanced model we can use. We instead opt for stopping here under the view that at this point what we would gain from further explorations of boosting would not be significantly better from what we already have.
5 Leaderboard
| group | f1 score | rank median | rank Q1 | rank Q3 | precision | recall |
|---|---|---|---|---|---|---|
| gbjj2 | 0.4675 | 5 | 2 | 9 | 0.3892 | 0.5854 |
| brand_adc | 0.4674 | 6 | 3 | 9 | 0.3816 | 0.6029 |
| bill&richard | 0.4673 | 6 | 3 | 10 | 0.3762 | 0.6168 |
| Deni2 | 0.4673 | 7 | 3 | 11 | 0.3824 | 0.6004 |
| hello | 0.4671 | 8 | 4 | 12 | 0.3809 | 0.6036 |
| gemini_raw | 0.4666 | 12 | 7 | 16 | 0.3811 | 0.6017 |
| sourdough3 | 0.4663 | 13 | 9 | 17 | 0.3794 | 0.6048 |
| 什么时候能更新完讲义 | 0.4662 | 14 | 9 | 18 | 0.3893 | 0.5811 |
| epicfire77day2 | 0.4661 | 14 | 9 | 18 | 0.3843 | 0.5921 |
| MJ2 | 0.4661 | 14 | 9 | 19 | 0.3866 | 0.5867 |
| lucas_final_pred | 0.4660 | 15 | 11 | 18 | 0.3814 | 0.5989 |
| Dragon_Warrior_v2 | 0.4657 | 16 | 12 | 20 | 0.3797 | 0.6023 |
| lab9ez | 0.4657 | 17 | 12 | 21 | 0.3681 | 0.6337 |
| gemini_breiman | 0.4653 | 19 | 15 | 21 | 0.3778 | 0.6057 |
| 8279.2 | 0.4652 | 20 | 16 | 22 | 0.3786 | 0.6032 |
| 567 | 0.4649 | 20 | 17 | 23 | 0.3772 | 0.6056 |
| aaa123___ | 0.4648 | 21 | 18 | 24 | 0.3811 | 0.5956 |
| kdot | 0.4643 | 23 | 19 | 25 | 0.3721 | 0.6174 |
| erez | 0.4641 | 23 | 20 | 26 | 0.3754 | 0.6077 |
| Darkeee | 0.4639 | 24 | 20 | 26 | 0.3829 | 0.5886 |
| apeman27-2 | 0.4629 | 26 | 24 | 29 | 0.3709 | 0.6156 |
| zhej2 | 0.4624 | 27 | 26 | 29 | 0.3762 | 0.6000 |
| Rith_God (1) | 0.4620 | 28 | 26 | 30 | 0.3799 | 0.5896 |
| lucas_baseline_pred | 0.4616 | 29 | 27 | 31 | 0.3746 | 0.6013 |
| lemon | 0.4613 | 30 | 27 | 32 | 0.3800 | 0.5870 |
| concurrentenrollment | 0.4608 | 31 | 29 | 32 | 0.3743 | 0.5995 |
| yay2 | 0.4593 | 34 | 32 | 36 | 0.3733 | 0.5969 |
| T | 0.4592 | 34 | 32 | 36 | 0.3753 | 0.5912 |
| pollen (2) | 0.4590 | 34 | 32 | 36 | 0.3737 | 0.5948 |
| Reda | 0.4590 | 35 | 32 | 37 | 0.3652 | 0.6174 |
| Keval_Lab8 | 0.4585 | 36 | 33 | 38 | 0.3490 | 0.6683 |
| lovestat | 0.4580 | 37 | 34 | 39 | 0.3723 | 0.5951 |
| purpleapple | 0.4580 | 36 | 34 | 39 | 0.3813 | 0.5733 |
| 1001 | 0.4566 | 39 | 37 | 42 | 0.3522 | 0.6492 |
| delicious_coffee | 0.4561 | 40 | 38 | 46 | 0.3589 | 0.6255 |
| kaihong ren 1 | 0.4557 | 41 | 40 | 45 | 0.3629 | 0.6122 |
| ardak | 0.4549 | 44 | 41 | 48 | 0.3592 | 0.6204 |
| willster | 0.4548 | 44 | 42 | 48 | 0.3554 | 0.6314 |
| han2159 | 0.4547 | 45 | 42 | 49 | 0.3521 | 0.6419 |
| letsTryIt | 0.4545 | 46 | 43 | 51 | 0.3532 | 0.6374 |
| YNWA | 0.4542 | 49 | 45 | 53 | 0.3596 | 0.6164 |
| Cam | 0.4541 | 50 | 45 | 54 | 0.3554 | 0.6287 |
| bears | 0.4541 | 50 | 46 | 55 | 0.3554 | 0.6286 |
| 154_Lab_7_Submission | 0.4541 | 50 | 47 | 54 | 0.3602 | 0.6141 |
| Classicals | 0.4538 | 52 | 47 | 58 | 0.3505 | 0.6435 |
| Stat154 too hard | 0.4537 | 53 | 48 | 56 | 0.3620 | 0.6079 |
| 154lab111 | 0.4536 | 53 | 49 | 57 | 0.3627 | 0.6055 |
| sg1 | 0.4536 | 54 | 49 | 59 | 0.3503 | 0.6434 |
| my_name (2) | 0.4535 | 54 | 48 | 58 | 0.3666 | 0.5947 |
| edwards | 0.4535 | 55 | 51 | 59 | 0.3626 | 0.6053 |
| dexhome | 0.4534 | 56 | 52 | 59 | 0.3625 | 0.6053 |
| Sid | 0.4533 | 56 | 50 | 60 | 0.3687 | 0.5883 |
| plsfix | 0.4532 | 57 | 50 | 61 | 0.3688 | 0.5878 |
| Pengbo | 0.4532 | 57 | 52 | 61 | 0.3680 | 0.5898 |
| msstumble | 0.4531 | 57 | 46 | 63 | 0.3481 | 0.6490 |
| pearl | 0.4531 | 58 | 54 | 61 | 0.3590 | 0.6141 |
| Beta | 0.4530 | 58 | 51 | 62 | 0.3742 | 0.5738 |
| OJP_Final | 0.4528 | 60 | 54 | 63 | 0.3468 | 0.6522 |
| arya_deshmukh | 0.4525 | 61 | 53 | 64 | 0.3536 | 0.6281 |
| corn | 0.4494 | 65 | 65 | 65 | 0.3597 | 0.5985 |
| anon_stat234 | 0.4471 | 66 | 66 | 67 | 0.3172 | 0.7569 |
| LanaDelRey | 0.4469 | 66 | 66 | 67 | 0.3212 | 0.7345 |
| wupenghao2 | 0.4401 | 68 | 68 | 69 | 0.3055 | 0.7863 |
| vivreplm | 0.4394 | 69 | 69 | 70 | 0.3054 | 0.7828 |
| Pavel_lab9 | 0.4394 | 69 | 69 | 70 | 0.3100 | 0.7538 |
| midterm_good_luck | 0.4379 | 71 | 71 | 71 | 0.3073 | 0.7618 |
| jiver2 | 0.4336 | 72 | 72 | 72 | 0.3102 | 0.7203 |
| pd | 0.4300 | 73 | 73 | 73 | 0.2910 | 0.8234 |
| taeyoungkim | 0.4247 | 74 | 74 | 75 | 0.4221 | 0.4273 |
| cvvc_121 | 0.4239 | 75 | 74 | 75 | 0.4511 | 0.3997 |
| Radon-Nikodym-2 | 0.4114 | 76 | 76 | 76 | 0.3666 | 0.4687 |
| dp | 0.3979 | 78 | 77 | 78 | 0.3133 | 0.5453 |
| o2-4 | 0.3979 | 78 | 77 | 79 | 0.3123 | 0.5480 |
| cpv | 0.3977 | 78 | 78 | 79 | 0.3124 | 0.5472 |
| advanced_v2 | 0.3288 | 80 | 80 | 80 | 0.5020 | 0.2444 |
| pg | 0.2855 | 81 | 81 | 81 | 0.4939 | 0.2008 |
| hamster | 0.2764 | 82 | 82 | 82 | 0.5475 | 0.1848 |
| lost_and_confused2 | 0.2466 | 83 | 83 | 83 | 0.5192 | 0.1617 |
| Prediction SCDP | 0.2278 | 84 | 84 | 84 | 0.5250 | 0.1454 |
| silly | 0.1929 | 86 | 86 | 86 | 0.5551 | 0.1167 |
| tent2 | 0.1591 | 87 | 87 | 87 | 0.5740 | 0.0923 |
| Cole | 0.1366 | 88 | 88 | 88 | 0.4472 | 0.0806 |
| elle | 0.1155 | 89 | 89 | 90 | 0.5793 | 0.0641 |
| billhewang | 0.1153 | 90 | 89 | 90 | 0.5718 | 0.0641 |
| pipis | 0.0000 | 91 | 91 | 91 | 0.0000 | 0.0000 |
6 What LLMs did
Gemini raw did some weird pipeline, but its final prediction was just running gradient boosting with the default parameter choice from the sklearn package. Nonetheless, it did one thing significantly better than I did: it explored more methods, instead of wasting too much time tuning the hparams of a single method. This allowed it to find out that gradient boosting was top performing merely with its raw hparam choices.
Gemini Breiman went straight for random forests and then the HistGradientBoosting like I did.
The rest was pretty much standard like everyone, CV for error estimation. But the LLMs were quite lazy with hparam tuning. As the empirical results show, that was mostly ok. The biggest differences came from the methods used, not the hparam tuning itself.
7 The tied top 5 from the class
gbjj2 did random forests and gradientboosting with a wider hparam sweep than I did. Also they added some manually engineered features including interactions. There was a final step of averaging the error of RF and GB which I believe might have made the predictions worse, but the final score was high any way, which means one of the two averaged models was individually very good.
Deni2 using light gbm but with much more background data processing and manual inspection. They manually added several interactions which they found through exploring the data.
bill and richard did some heavy manual feature engineering too and did a more careful choice of which hparams to test, they also worked around HistGD, GD and random forests.
hello did similarly to the previous two: some heavy manual feature addition plus a very large hparam sweep with light GBM.
brand adc search over a very wide class of models and wide range of hparams. There was not that much manual engineering, it was more about scaling.
Overall all top 5 submissions were very good. They’ve put in a lot of work and it paid off. The first 4 leveraged a lot adding in more features, specially interactions, and increased a lot the amount of degrees of freedom to optimize over, even if they were not entirely sure whether the addition was worthline.
That’s the spirit. The more degrees of freedom you have to optimize over, and the more computational power you are willing to leverage to test things, the better the results will be.
I’d just like to point out, as usual, though, that the gains from saturating on a narrow set of ideas can become marginal pretty quickly. Everyone at the top of the ranking essentially ended up using a variant of RF/GB with just some small tweaks of adding more features or doing larger hparam sweeps. And, as usual, the top of the ranking is quite lumped up.
If we compare gemini_breiman and gemini_raw, altough gemini got a slighly better result, the level of effort was significantly larger than gemini_breiman, whose final f1_score was quite close but with a very parsimonious script.
It’s always good to keep in mind when over-optimizing for a certain class of models is providing diminishing returns. This is a class context, and I capped how complicated the models you could use, so it was worth the effort to push RF/GD to the limit. But in a broader context, when the performance starts saturating like this, either it’s better to call it a day if you’ve reached the goal, or, more crucially, seek for a major conceptual change on what you are doing, like trying methods from a very different class or doing some very big change to the model.
As we will see in the last lab, one thing that makes deep learning such a rich framework, is that you can formulate very different model classes within the same framework by redesigning the network architecture, and that can give you a wide range of possible ways you can improve your model.
Whereas for RF/GD you only have a handful of hparams which will usually at best give you some marginal gains and get saturated very quickly, neural networks have a magical hparam called “architecture” which allows for such distinct variations that they’ve been the single major paradigm in the field for more than 10 years. But their dominance relied on re-inventing new and more powerful architectures every now and then.
That said though, for certain types of data, like tabular data, RF/GD is taken to be SOTA as you indirectly found out (no one on the top 5 ended up using a kernel method, as far I remember). Some recent approaches such as TabPFN are slowly starting to break the dominance of GD for tabular data. But that’s a very recent thing.