Lab 10 - Classification comptetition Round Up

Author

Lucas Schwengber / Erez Buchweitz / Josh Davis

1 Load data

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
dir_path = r"../../datasets/diabetes/"
X = pd.read_parquet(f"{dir_path}X_train.parquet")
y = pd.read_parquet(f"{dir_path}y_train.parquet").squeeze()
X.shape, y.shape
((200000, 21), (200000,))
X_test = pd.read_parquet(f"{dir_path}X_test.parquet")

When you look at just these raw statistics from the training data, one thing should pop out to you immediately: you have massively more observations than features. This is the scenario where, if you have a prediction task, like we have here, you should throw your statistical hat out and put on an ML hat.

What is the key difference here? The key difference is that, with this ratio between features and sample points, you should try to be as data-driven as possible. What this means is, instead of trying to explicitly encode strong “priors” (in an informal, not bayesian sense) on what the predictor should look like (e.g. linear, linear with a hand crafted feature and so on), since you have this much data, you should just use the most expressive models you can and let them figure out the right “priors”.

This works because you have much more data then features. Even if at the end of the day your model is merely a linear or quadratic function of the features, with this amount of data, a very expressive model will be able to approximate that as accurately as needed for the prediction task. The added cost of doing manual feature engineering is note that significant, as we will see.

1.1 Checking features

{col: str(np.sort(X[col].unique()).tolist()) for col in X.columns}
{'HighBP': '[0, 1]',
 'HighChol': '[0, 1]',
 'CholCheck': '[0, 1]',
 'BMI': '[12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 86, 87, 88, 89, 90, 91, 92, 95, 96, 98]',
 'Smoker': '[0, 1]',
 'Stroke': '[0, 1]',
 'HeartDiseaseorAttack': '[0, 1]',
 'PhysActivity': '[0, 1]',
 'Fruits': '[0, 1]',
 'Veggies': '[0, 1]',
 'HvyAlcoholConsump': '[0, 1]',
 'AnyHealthcare': '[0, 1]',
 'NoDocbcCost': '[0, 1]',
 'GenHlth': '[1, 2, 3, 4, 5]',
 'MentHlth': '[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]',
 'PhysHlth': '[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]',
 'DiffWalk': '[0, 1]',
 'Sex': '[0, 1]',
 'Age': '[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]',
 'Education': '[1, 2, 3, 4, 5, 6]',
 'Income': '[1, 2, 3, 4, 5, 6, 7, 8]'}
features_binary = [col for col in X.columns if X[col].nunique() == 2]
features_numeric = [col for col in X.columns if X[col].nunique() > 2]
features_numeric, features_binary
(['BMI', 'GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income'],
 ['HighBP',
  'HighChol',
  'CholCheck',
  'Smoker',
  'Stroke',
  'HeartDiseaseorAttack',
  'PhysActivity',
  'Fruits',
  'Veggies',
  'HvyAlcoholConsump',
  'AnyHealthcare',
  'NoDocbcCost',
  'DiffWalk',
  'Sex'])

1.2 Class balance

np.mean(y)
0.139155
### Check data types

X.dtypes
HighBP                  int64
HighChol                int64
CholCheck               int64
BMI                     int64
Smoker                  int64
Stroke                  int64
HeartDiseaseorAttack    int64
PhysActivity            int64
Fruits                  int64
Veggies                 int64
HvyAlcoholConsump       int64
AnyHealthcare           int64
NoDocbcCost             int64
GenHlth                 int64
MentHlth                int64
PhysHlth                int64
DiffWalk                int64
Sex                     int64
Age                     int64
Education               int64
Income                  int64
dtype: object

Important pre-processing steps: some columns have categorical values and some have numeric. We need to convert them first.

ordinal_categorical_cols = ["GenHlth", "Education", "Income"]
nominal_categorical_cols = features_binary
categorical_cols = nominal_categorical_cols + ordinal_categorical_cols

# Convert in X
for col in nominal_categorical_cols:
    X[col] = X[col].astype("category")

for col in ordinal_categorical_cols:
    X[col] = pd.Categorical(
        X[col],
        categories=sorted(X[col].dropna().unique()),
        ordered=True
    )

# Keep X_test aligned (recommended)
for col in nominal_categorical_cols:
    X_test[col] = pd.Categorical(X_test[col], categories=X[col].cat.categories)

for col in ordinal_categorical_cols:
    X_test[col] = pd.Categorical(
        X_test[col],
        categories=X[col].cat.categories,
        ordered=True
    )

# Quick check
X.dtypes
HighBP                  category
HighChol                category
CholCheck               category
BMI                        int64
Smoker                  category
Stroke                  category
HeartDiseaseorAttack    category
PhysActivity            category
Fruits                  category
Veggies                 category
HvyAlcoholConsump       category
AnyHealthcare           category
NoDocbcCost             category
GenHlth                 category
MentHlth                   int64
PhysHlth                   int64
DiffWalk                category
Sex                     category
Age                        int64
Education               category
Income                  category
dtype: object

Here a crucial decision was made. This is a place in which stats and ML differ significantly. Note that GenHlth, Education and Income in principle could be treated as either ordinal or numeric. I explicitly decided to encode them as categorical. Why? That’s because categorical variables are the least informative way of encoding the variables. If you encode them as numeric or ordinal, you are implicitly assuming some form of spatial dependency. If you were in a small sample/features regime, adding in this assumption explicitly would be important to leverage you lack of samples. However here since we have soo many samples, it is better to encode the variable in the most agnostic way possible: we do not assume any relation between the different categories. Instead we will allow the algorithm to leverage the large sample to figure any relations that might be there. In some sense encoding too much prior assumptions when you have this many data can actually work against you, as we will see shortly. Think about LLMs for example. At the very base level, you feed them tokens which are essentially one-hot encoded. Either this or you encode the tokens using some pre-trained model, which itself was trained initially having all tokens one-hot encoded. Think about this for me a moment. As humans we have a bunch of prior knowledge about which tokens should be similar to one another, and yet when training these things, we completely forget about this, and just feed the tokens one-hot encoded as if each were completely unrelated to the others. The trick is that the large sample takes care of figuring out the necessary relations for the task in a way that’s more effective than encoding them a priori.

Question: why didn’t I do the same thing for the other numerical variables? And more broadly, why shouldn’t I do that for an arbitrary numerical variable?

features_cat = [col for col in X.columns if str(X[col].dtype) == "category"]
features_numeric = [col for col in X.columns if str(X[col].dtype) == "int64"]
features_cat, features_numeric
(['HighBP',
  'HighChol',
  'CholCheck',
  'Smoker',
  'Stroke',
  'HeartDiseaseorAttack',
  'PhysActivity',
  'Fruits',
  'Veggies',
  'HvyAlcoholConsump',
  'AnyHealthcare',
  'NoDocbcCost',
  'GenHlth',
  'DiffWalk',
  'Sex',
  'Education',
  'Income'],
 ['BMI', 'MentHlth', 'PhysHlth', 'Age'])

2 Model training and evaluation scaffolding

def make_cv_pairs(n_folds, seed, X=X, y=y):
    fold = np.resize(np.arange(n_folds), X.shape[0])
    rng = np.random.default_rng(seed)
    rng.shuffle(fold)

    cv_pairs = [{
        "X_train": X.loc[fold != i, :],
        "y_train": y.loc[fold != i],
        "X_valid": X.loc[fold == i, :],
        "y_valid": y.loc[fold == i],
    } for i in np.unique(fold)]

    return cv_pairs

# Example usage
nfolds = 5
cv_pairs = make_cv_pairs(n_folds=nfolds, seed=42)
[(entry["X_train"].shape[0], entry["X_valid"].shape[0]) for entry in cv_pairs]
[(160000, 40000),
 (160000, 40000),
 (160000, 40000),
 (160000, 40000),
 (160000, 40000)]
from sklearn.metrics import f1_score, precision_recall_curve

def fit_predict_score(X_train, y_train, X_valid, y_valid, model, in_sample=False):
    model.fit(X_train, y_train)
    thresh = find_best_threshold(y_train, predict(model, X_train))
    if in_sample:
        pred = predict(model, X_train) > thresh
        return f1_score(y_train, pred)
    else:
        pred = predict(model, X_valid) > thresh
        return f1_score(y_valid, pred)
    

def predict(model, X):
    if hasattr(model, "predict_proba"):
        return model.predict_proba(X)[:,1]
    elif hasattr(model, "decision_function"):
        return model.decision_function(X)
    else:
        return model.predict(X)

def find_best_threshold(y, pred):
    y_arr = np.asarray(y)
    pred_arr = np.asarray(pred)

    valid_mask = np.isfinite(y_arr) & np.isfinite(pred_arr)
    if not np.any(valid_mask):
        return 0.5

    precision, recall, thresholds = precision_recall_curve(y_arr[valid_mask], pred_arr[valid_mask])

    # align with thresholds (precision/recall are one element longer)
    f1 = 2 * precision[:-1] * recall[:-1] / np.where(
        precision[:-1] + recall[:-1] == 0, 1, precision[:-1] + recall[:-1]
    )

    finite_f1 = np.isfinite(f1)
    if not np.any(finite_f1):
        return 0.5

    valid_idx = np.where(finite_f1)[0]
    best_idx = valid_idx[np.argmax(f1[finite_f1])]
    return thresholds[best_idx]

def cv_workflow(model, seed=42, nfolds=5, **kwargs):
    cv_pairs = make_cv_pairs(n_folds=nfolds, seed=seed)
    if nfolds == 2:
        scores = [fit_predict_score(model=model, **cv_pairs[0], **kwargs)]
    else:
        scores = [fit_predict_score(model=model, **entry, **kwargs) for entry in cv_pairs]
    return np.mean(scores)

2.1 Simple training pipeline

from sklearn.preprocessing import SplineTransformer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from datetime import datetime
from sklearn.preprocessing import OneHotEncoder
predictors = {
    "logistic": LogisticRegression(C=1),
    "lda": LinearDiscriminantAnalysis(shrinkage="auto", solver="lsqr"),
    "svm": LinearSVC(C=0.01, dual="auto")
}

def build_model(predictor):
    transformer_numeric = Pipeline(steps=[
        ('spline', SplineTransformer(degree=3, n_knots=20)),
        ('scaler', StandardScaler())
    ])
    transformer_categorical = Pipeline(steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='if_binary'))
    ])

    preprocessor = ColumnTransformer(
        transformers=[
            ('numeric', transformer_numeric, features_numeric),
            ('categorical', transformer_categorical, features_cat)
        ]
    )

    model = Pipeline([
        ("preprocessor", preprocessor),
        ("predictor", predictor),
    ])

    return model

print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")
for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=2)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")
[2026-04-13 09:42:10] end
[2026-04-13 09:42:15] logistic=0.4706737896120605
[2026-04-13 09:42:19] lda=0.4628085249768343
[2026-04-13 09:42:34] svm=0.4705021544131595
[2026-04-13 09:42:34] end

The results for the same training run if we encode as categorical only the binary variables, and the rest as numeric (keeping the splines) is the following:

[2026-04-11 13:52:40] logistic=0.4611625371039416 [2026-04-11 13:52:43] lda=0.45808937405668904 [2026-04-11 13:52:47] svm=0.4554535702471137

Note how the more expressive models like SVM and logistic regression get significant gains in improvement from this elementary change.

With just this raw logistic model with no hparam tuning already placed me in the top 30.

predictor = LogisticRegression(C=1)

model = build_model(predictor)
model.fit(X, y)
thresh = find_best_threshold(y, predict(model, X))
y_pred = predict(model, X_test) > thresh
np.save("lucas_baseline_pred.npy", y_pred)

Following up on our conceptual intuition, I will completely ignore LDA and QDA because both models are less expressive and would be overly reliant on manually adding new features.

To start our evaluation of the logistic regression model we will see whether it is overfitting or not, to see whether we are lacking expressivity or regularization.

predictor = LogisticRegression(C=100000)
model = build_model(predictor)

train_score = cv_workflow(model, seed=10, in_sample=True)
print(f"train f1 score: {train_score}")

test_error = cv_workflow(model, seed=10, in_sample=False)
print(f"test f1 score: {test_error}")
train f1 score: 0.47144534867743404
test f1 score: 0.46981010798112

Not much overfitting going on, which means the logistic regression is not expressive enough as it is. We will tune its hparams any way, just to see if there are any meaningful gains. Since the regulazation is already weak, the results above should tell us we are pretty close to the best we can get for this combination of logistic+features.

Let us check the sampling variability of the error using the bootstrap to get a sense of what is a meaningful change in the CV error. We will assume it does not change that meaningfully across models.

from tqdm import tqdm

scores = []
predictor = LogisticRegression(C=1, max_iter=100)
model = build_model(predictor)
for seed in tqdm(range(0, 1000, 100)):
    test_error = cv_workflow(model, seed=seed, in_sample=False)
    scores.append(test_error)

print(f"Std test error: {np.std(scores)}")
100%|██████████| 10/10 [07:49<00:00, 46.96s/it]
Std test error: 0.0003585941661384227

3 Tune hparams for logistic

I will only try to tune regularization strength and class weights.

Finding good scales for regulatization

grid_C = [1e-3, 1e-1, 1, 1e1, 1e3, 1e6]

predictors = {
    C: LogisticRegression(C=C)
    for C in [0.1, 1, 1e1, 1e3, 1e6]
}
for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=2, seed=10, in_sample=False)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")
[2026-04-13 09:49:01] 0.1=0.47155742969644504
[2026-04-13 09:49:06] 1=0.4716089205272232
[2026-04-13 09:49:13] 10.0=0.4714083372677535
[2026-04-13 09:49:18] 1000.0=0.47155917414621945
[2026-04-13 09:49:25] 1000000.0=0.47155917414621945
[2026-04-13 09:49:25] end
# formula from sklearn for class weight
base_class_weight = 1/(2*np.mean(y)) 

print(base_class_weight)

class_weight_grid = [0.75, 1, 3.6, 10]
3.5931155905285475
predictors = {
    cw: LogisticRegression(C=1, class_weight={1:cw, 0:1/(2-1/cw)})
    for cw in class_weight_grid
}
for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=2, seed=10, in_sample=False)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")
[2026-04-13 09:50:51] 0.75=0.471191066306497
[2026-04-13 09:50:57] 1=0.4716089205272232
[2026-04-13 09:51:01] 3.6=0.4711612126518551
[2026-04-13 09:51:07] 10=0.46974615507423767
[2026-04-13 09:51:07] end
from itertools import product

updated_C_grid = [1e-1, 1, 10]
updated_weight_grid = [0.75, 1, 3.6]

grid = list(product(updated_C_grid, updated_weight_grid))

predictors = {
    f"(C={C}, cw={cw})": LogisticRegression(C=C, class_weight={1:cw, 0:1/(2-1/cw)})
    for C, cw in grid
}

for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")
[2026-04-13 09:56:30] (C=0.1, cw=0.75)=0.4701561141414578
[2026-04-13 09:56:50] (C=0.1, cw=1)=0.4700004190191736
[2026-04-13 09:57:10] (C=0.1, cw=3.6)=0.4696269553925762
[2026-04-13 09:57:32] (C=1, cw=0.75)=0.47006408953876333
[2026-04-13 09:57:55] (C=1, cw=1)=0.46993197092635947
[2026-04-13 09:58:18] (C=1, cw=3.6)=0.4696297170608033
[2026-04-13 09:58:47] (C=10, cw=0.75)=0.4699358590786124
[2026-04-13 09:59:17] (C=10, cw=1)=0.46949377588414504
[2026-04-13 09:59:43] (C=10, cw=3.6)=0.4696041031786275
[2026-04-13 09:59:43] end
logistic_hparams = {
    "C": 1,
    "class_weight": {1: 0.75, 0: 1/(2-1/0.75)}
}

SVM:

grid_C = [1e-3, 1e-1, 1, 1e1, 1e3, 1e6]

predictors = {
    C: LinearSVC(C=C, dual="auto")
    for C in grid_C
}
for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")
[2026-04-13 10:02:51] 0.001=0.46638710690409685
[2026-04-13 10:03:38] 0.1=0.4672905743090041
[2026-04-13 10:04:28] 1=0.4674689112763988
[2026-04-13 10:05:27] 10.0=0.46750884270792076
[2026-04-13 10:06:25] 1000.0=0.46745973596271434
[2026-04-13 10:07:22] 1000000.0=0.46745973596271434
[2026-04-13 10:07:22] end
predictors = {
    cw: LinearSVC(C=1, class_weight={1:cw, 0:1/(2-1/cw)})
    for cw in class_weight_grid
}

for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")
[2026-04-13 10:09:55] 0.75=0.4666714366569618
[2026-04-13 10:10:44] 1=0.4674689112763988
[2026-04-13 10:12:16] 3.6=0.4698845813182954
[2026-04-13 10:13:19] 10=0.4672258720079015
[2026-04-13 10:13:19] end
updated_C_grid = [0.1, 1, 10]
updated_weight_grid = [0.75, 1, 3.6]

grid = list(product(updated_C_grid, updated_weight_grid))

predictors = {
    f"(C={C}, cw={cw})": LinearSVC(C=C, class_weight={1:cw, 0:1/(2-1/cw)})
    for C, cw in grid
}

for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=2, seed=10, in_sample=False)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")
[2026-04-13 10:16:02] (C=0.1, cw=0.75)=0.46818788894887897
[2026-04-13 10:16:21] (C=0.1, cw=1)=0.4696046478673657
[2026-04-13 10:16:49] (C=0.1, cw=3.6)=0.47180109157064887
[2026-04-13 10:17:04] (C=1, cw=0.75)=0.4682738230611781
[2026-04-13 10:17:34] (C=1, cw=1)=0.46964503243717953
[2026-04-13 10:18:21] (C=1, cw=3.6)=0.4713025798829574
[2026-04-13 10:18:39] (C=10, cw=0.75)=0.4682605111294312
[2026-04-13 10:19:22] (C=10, cw=1)=0.46948197241340245
[2026-04-13 10:25:44] (C=10, cw=3.6)=0.47134772814680514
[2026-04-13 10:25:44] end
svc_hparams = {
    "C": 0.1,
    "class_weight": {1: 3.6, 0: 1/(2-1/3.6)}
}

Let us check overfitting for the best models soo far to see if we need more expressivity

logistic = LogisticRegression(**logistic_hparams)
model = build_model(logistic)

train_score = cv_workflow(model, nfolds=3, seed=10, in_sample=True)
print(f"train f1 logistic: {train_score}")
test_error = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
print(f"test f1 logistic: {test_error}")

svc = LinearSVC(**svc_hparams)
model = build_model(svc)

train_score = cv_workflow(model, nfolds=3, seed=10, in_sample=True)
print(f"train f1 svc: {train_score}")

test_error = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
print(f"test f1 svc: {test_error}")
train f1 logistic: 0.47138173560527696
test f1 logistic: 0.47006408953876333
train f1 svc: 0.47117195020562536
test f1 svc: 0.46997019485179575

We have very little overfitting. This suggests the methods could use more expressivity. One approach we could take now would be to manually add in more features, as some of you did. That’s a reasonable approach and it led to some of the best scores. I however opt for a more parsimonious approach in light of what I elaborated on early on. I believe our feature representation is already expressive enough. What we need is a method that is able to natively explore more non-linear functions of those features. That suggests trying boosting.

4 Gradient boosting

Sklearn has two implementations of gradient boosting. I am picking the most efficient to run so I can explore more hparams quickly.

from sklearn.ensemble import HistGradientBoostingClassifier

model = HistGradientBoostingClassifier()

score = cv_workflow(model, nfolds=2)
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
l2_grid = [0, 1, 10]
lr_grid = [0.01, 0.1, 1]
max_lfeaves_grid = [30, 100]
class_weight_grid = [0.75, 1, 2, 3.6]

full_grid = list(product(l2_grid, lr_grid, max_lfeaves_grid, class_weight_grid))
results = {}
for l2, lr, max_leaves, cw in full_grid:
    model = HistGradientBoostingClassifier(
        l2_regularization=l2,
        learning_rate=lr,
        max_leaf_nodes=max_leaves,
        class_weight={1:cw, 0:1/(2-1/cw)}
    )
    score = cv_workflow(model, nfolds=2, seed=10, in_sample=False)
    results[(l2, lr, max_leaves, cw)] = score
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] (l2={l2}, lr={lr}, max_leaves={max_leaves}, cw={cw})={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")
sorted_results = sorted(results.items(), key=lambda x: x[1], reverse=True)
sorted_results
[((1, 0.1, 30, 1), 0.47397189708954585),
 ((1, 0.1, 30, 3.6), 0.4739598881862838),
 ((1, 0.1, 30, 0.75), 0.47355342563010294),
 ((10, 0.1, 30, 1), 0.4734924659730116),
 ((10, 0.1, 30, 3.6), 0.4731454213995232),
 ((0, 0.1, 30, 3.6), 0.47312541473125413),
 ((10, 0.1, 100, 2), 0.4727006428164583),
 ((0, 0.1, 30, 2), 0.4725328147788041),
 ((0, 0.1, 100, 3.6), 0.47230571060541005),
 ((10, 0.1, 30, 2), 0.4721347790602775),
 ((1, 0.1, 30, 2), 0.4718932443703086),
 ((1, 0.1, 100, 3.6), 0.47136248485516025),
 ((0, 0.1, 30, 1), 0.4712298078056833),
 ((1, 0.1, 100, 2), 0.47118556553168295),
 ((10, 0.1, 100, 3.6), 0.47114688594840315),
 ((1, 0.01, 100, 0.75), 0.4706592332892695),
 ((0, 0.01, 100, 1), 0.47044522968197877),
 ((10, 0.1, 30, 0.75), 0.47032138333284235),
 ((10, 0.1, 100, 0.75), 0.47022146407458604),
 ((0, 0.1, 30, 0.75), 0.47020167003285807),
 ((0, 0.1, 100, 2), 0.4700665188470067),
 ((1, 0.01, 100, 1), 0.4699048462049126),
 ((0, 0.01, 100, 2), 0.46963495575221237),
 ((1, 0.1, 100, 1), 0.4695717074290963),
 ((10, 0.01, 100, 1), 0.4693586968903064),
 ((1, 0.1, 100, 0.75), 0.4692389758179232),
 ((10, 0.01, 100, 0.75), 0.4691295970511854),
 ((10, 0.01, 100, 2), 0.4691264857289168),
 ((1, 0.01, 100, 2), 0.46912533559417835),
 ((0, 0.01, 100, 0.75), 0.46909183413889594),
 ((0, 0.1, 100, 1), 0.46778086712090194),
 ((0, 0.01, 100, 3.6), 0.46741679622651455),
 ((1, 0.01, 100, 3.6), 0.4673497231213718),
 ((10, 0.01, 100, 3.6), 0.4670399133104569),
 ((1, 0.01, 30, 0.75), 0.46703837323712694),
 ((10, 0.01, 30, 1), 0.4665549615681268),
 ((10, 0.01, 30, 0.75), 0.4657234339509045),
 ((1, 0.01, 30, 1), 0.46565627680869315),
 ((0, 0.01, 30, 0.75), 0.46562022503125433),
 ((10, 0.1, 100, 1), 0.4653063752051274),
 ((1, 0.01, 30, 2), 0.46506200645844076),
 ((0, 0.01, 30, 1), 0.4645537497975271),
 ((0, 0.01, 30, 2), 0.4643511943189154),
 ((10, 0.01, 30, 2), 0.46387301915570145),
 ((0, 0.1, 100, 0.75), 0.4635676861216035),
 ((1, 0.01, 30, 3.6), 0.46193345559339793),
 ((0, 0.01, 30, 3.6), 0.4619190209608124),
 ((10, 0.01, 30, 3.6), 0.4615306433580648),
 ((10, 1, 30, 2), 0.45838148659695016),
 ((10, 1, 30, 3.6), 0.4577893354065575),
 ((1, 1, 30, 2), 0.4569364312074432),
 ((10, 1, 30, 0.75), 0.4559430071241095),
 ((1, 1, 30, 3.6), 0.45558448677538715),
 ((10, 1, 30, 1), 0.4555018972059331),
 ((0, 1, 30, 1), 0.4540675302151645),
 ((0, 1, 30, 3.6), 0.45280931433237626),
 ((1, 1, 30, 1), 0.4518628110289173),
 ((0, 1, 30, 2), 0.4518586628136083),
 ((10, 1, 100, 3.6), 0.4484416539397025),
 ((10, 1, 100, 2), 0.44667545360217614),
 ((0, 1, 100, 2), 0.44229451585496204),
 ((0, 1, 100, 3.6), 0.4418382806800965),
 ((1, 1, 100, 3.6), 0.4410955171004224),
 ((1, 1, 30, 0.75), 0.44021363145806236),
 ((0, 1, 30, 0.75), 0.4394418604651163),
 ((10, 1, 100, 0.75), 0.43937093934057925),
 ((10, 1, 100, 1), 0.43920722017330815),
 ((0, 1, 100, 1), 0.4386407558514065),
 ((1, 1, 100, 2), 0.4374098228198765),
 ((0, 1, 100, 0.75), 0.4219705041215724),
 ((1, 1, 100, 0.75), 0.4197631686402613),
 ((1, 1, 100, 1), 0.41947907826881653)]

recompute error estimates to get more reliable results

top_10 = [{"l2": key[0], "lr": key[1], "max_leaves": key[2], "cw": key[3]} for key, _ in sorted_results[:10]]

for param in top_10:
    model = HistGradientBoostingClassifier(
    l2_regularization=param["l2"],
    learning_rate=param["lr"],
    max_leaf_nodes=param["max_leaves"],
    class_weight={1:param["cw"], 0:1/(2-1/param["cw"])}
)
    test_error = cv_workflow(model, nfolds=10, seed=10, in_sample=False)
    print(f"params: {param}")
    print(f"test f1 grad boost: {test_error}")

The best hparams from this second more accurate run were “l2”: 1, “lr”: 0.1, “max_leaves”: 30, “cw”: 3.6. Let us do a final overfitting sanity check.

grad_boost_hparams = {"l2": 1, "lr": 0.1, "max_leaves": 30, "cw": 3.6}

model = HistGradientBoostingClassifier(
    l2_regularization=grad_boost_hparams["l2"],
    learning_rate=grad_boost_hparams["lr"],
    max_leaf_nodes=grad_boost_hparams["max_leaves"],
    class_weight={1:grad_boost_hparams["cw"], 0:1/(2-1/grad_boost_hparams["cw"])}
)

train_score = cv_workflow(model, nfolds=10, seed=5, in_sample=True)
print(f"train f1 grad boost: {train_score}")

test_error = cv_workflow(model, nfolds=10, seed=5, in_sample=False)
print(f"test f1 grad boost: {test_error}")
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
train f1 grad boost: 0.4836779370286172
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
test f1 grad boost: 0.4721004154120754
model.fit(X, y)
thresh = find_best_threshold(y, predict(model, X))
y_pred = predict(model, X_test) > thresh
np.save("lucas_final_pred.npy", y_pred)
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(

We are indeed overfitting. If we wanted to try to get an even better score we could try now to reduce overfitting a little bit by thinking about principled ways of reducing regularization. But in principle our hparam sweep over the boosting parameters already did that, and larger regularization did not seem to improve the final results. What seems to be clear though, is that our exploration of boosting is giving our diminishing returns at this point already. There are some tweaks we could try inside the boosting world, as it is the most advanced model we can use. We instead opt for stopping here under the view that at this point what we would gain from further explorations of boosting would not be significantly better from what we already have.

5 Leaderboard

group f1 score rank median rank Q1 rank Q3 precision recall
gbjj2 0.4675 5 2 9 0.3892 0.5854
brand_adc 0.4674 6 3 9 0.3816 0.6029
bill&richard 0.4673 6 3 10 0.3762 0.6168
Deni2 0.4673 7 3 11 0.3824 0.6004
hello 0.4671 8 4 12 0.3809 0.6036
gemini_raw 0.4666 12 7 16 0.3811 0.6017
sourdough3 0.4663 13 9 17 0.3794 0.6048
什么时候能更新完讲义 0.4662 14 9 18 0.3893 0.5811
epicfire77day2 0.4661 14 9 18 0.3843 0.5921
MJ2 0.4661 14 9 19 0.3866 0.5867
lucas_final_pred 0.4660 15 11 18 0.3814 0.5989
Dragon_Warrior_v2 0.4657 16 12 20 0.3797 0.6023
lab9ez 0.4657 17 12 21 0.3681 0.6337
gemini_breiman 0.4653 19 15 21 0.3778 0.6057
8279.2 0.4652 20 16 22 0.3786 0.6032
567 0.4649 20 17 23 0.3772 0.6056
aaa123___ 0.4648 21 18 24 0.3811 0.5956
kdot 0.4643 23 19 25 0.3721 0.6174
erez 0.4641 23 20 26 0.3754 0.6077
Darkeee 0.4639 24 20 26 0.3829 0.5886
apeman27-2 0.4629 26 24 29 0.3709 0.6156
zhej2 0.4624 27 26 29 0.3762 0.6000
Rith_God (1) 0.4620 28 26 30 0.3799 0.5896
lucas_baseline_pred 0.4616 29 27 31 0.3746 0.6013
lemon 0.4613 30 27 32 0.3800 0.5870
concurrentenrollment 0.4608 31 29 32 0.3743 0.5995
yay2 0.4593 34 32 36 0.3733 0.5969
T 0.4592 34 32 36 0.3753 0.5912
pollen (2) 0.4590 34 32 36 0.3737 0.5948
Reda 0.4590 35 32 37 0.3652 0.6174
Keval_Lab8 0.4585 36 33 38 0.3490 0.6683
lovestat 0.4580 37 34 39 0.3723 0.5951
purpleapple 0.4580 36 34 39 0.3813 0.5733
1001 0.4566 39 37 42 0.3522 0.6492
delicious_coffee 0.4561 40 38 46 0.3589 0.6255
kaihong ren 1 0.4557 41 40 45 0.3629 0.6122
ardak 0.4549 44 41 48 0.3592 0.6204
willster 0.4548 44 42 48 0.3554 0.6314
han2159 0.4547 45 42 49 0.3521 0.6419
letsTryIt 0.4545 46 43 51 0.3532 0.6374
YNWA 0.4542 49 45 53 0.3596 0.6164
Cam 0.4541 50 45 54 0.3554 0.6287
bears 0.4541 50 46 55 0.3554 0.6286
154_Lab_7_Submission 0.4541 50 47 54 0.3602 0.6141
Classicals 0.4538 52 47 58 0.3505 0.6435
Stat154 too hard 0.4537 53 48 56 0.3620 0.6079
154lab111 0.4536 53 49 57 0.3627 0.6055
sg1 0.4536 54 49 59 0.3503 0.6434
my_name (2) 0.4535 54 48 58 0.3666 0.5947
edwards 0.4535 55 51 59 0.3626 0.6053
dexhome 0.4534 56 52 59 0.3625 0.6053
Sid 0.4533 56 50 60 0.3687 0.5883
plsfix 0.4532 57 50 61 0.3688 0.5878
Pengbo 0.4532 57 52 61 0.3680 0.5898
msstumble 0.4531 57 46 63 0.3481 0.6490
pearl 0.4531 58 54 61 0.3590 0.6141
Beta 0.4530 58 51 62 0.3742 0.5738
OJP_Final 0.4528 60 54 63 0.3468 0.6522
arya_deshmukh 0.4525 61 53 64 0.3536 0.6281
corn 0.4494 65 65 65 0.3597 0.5985
anon_stat234 0.4471 66 66 67 0.3172 0.7569
LanaDelRey 0.4469 66 66 67 0.3212 0.7345
wupenghao2 0.4401 68 68 69 0.3055 0.7863
vivreplm 0.4394 69 69 70 0.3054 0.7828
Pavel_lab9 0.4394 69 69 70 0.3100 0.7538
midterm_good_luck 0.4379 71 71 71 0.3073 0.7618
jiver2 0.4336 72 72 72 0.3102 0.7203
pd 0.4300 73 73 73 0.2910 0.8234
taeyoungkim 0.4247 74 74 75 0.4221 0.4273
cvvc_121 0.4239 75 74 75 0.4511 0.3997
Radon-Nikodym-2 0.4114 76 76 76 0.3666 0.4687
dp 0.3979 78 77 78 0.3133 0.5453
o2-4 0.3979 78 77 79 0.3123 0.5480
cpv 0.3977 78 78 79 0.3124 0.5472
advanced_v2 0.3288 80 80 80 0.5020 0.2444
pg 0.2855 81 81 81 0.4939 0.2008
hamster 0.2764 82 82 82 0.5475 0.1848
lost_and_confused2 0.2466 83 83 83 0.5192 0.1617
Prediction SCDP 0.2278 84 84 84 0.5250 0.1454
silly 0.1929 86 86 86 0.5551 0.1167
tent2 0.1591 87 87 87 0.5740 0.0923
Cole 0.1366 88 88 88 0.4472 0.0806
elle 0.1155 89 89 90 0.5793 0.0641
billhewang 0.1153 90 89 90 0.5718 0.0641
pipis 0.0000 91 91 91 0.0000 0.0000

6 What LLMs did

Gemini raw did some weird pipeline, but its final prediction was just running gradient boosting with the default parameter choice from the sklearn package. Nonetheless, it did one thing significantly better than I did: it explored more methods, instead of wasting too much time tuning the hparams of a single method. This allowed it to find out that gradient boosting was top performing merely with its raw hparam choices.

Gemini Breiman went straight for random forests and then the HistGradientBoosting like I did.

The rest was pretty much standard like everyone, CV for error estimation. But the LLMs were quite lazy with hparam tuning. As the empirical results show, that was mostly ok. The biggest differences came from the methods used, not the hparam tuning itself.

7 The tied top 5 from the class

  • gbjj2 did random forests and gradientboosting with a wider hparam sweep than I did. Also they added some manually engineered features including interactions. There was a final step of averaging the error of RF and GB which I believe might have made the predictions worse, but the final score was high any way, which means one of the two averaged models was individually very good.

  • Deni2 using light gbm but with much more background data processing and manual inspection. They manually added several interactions which they found through exploring the data.

  • bill and richard did some heavy manual feature engineering too and did a more careful choice of which hparams to test, they also worked around HistGD, GD and random forests.

  • hello did similarly to the previous two: some heavy manual feature addition plus a very large hparam sweep with light GBM.

  • brand adc search over a very wide class of models and wide range of hparams. There was not that much manual engineering, it was more about scaling.

Overall all top 5 submissions were very good. They’ve put in a lot of work and it paid off. The first 4 leveraged a lot adding in more features, specially interactions, and increased a lot the amount of degrees of freedom to optimize over, even if they were not entirely sure whether the addition was worthline.

That’s the spirit. The more degrees of freedom you have to optimize over, and the more computational power you are willing to leverage to test things, the better the results will be.

I’d just like to point out, as usual, though, that the gains from saturating on a narrow set of ideas can become marginal pretty quickly. Everyone at the top of the ranking essentially ended up using a variant of RF/GB with just some small tweaks of adding more features or doing larger hparam sweeps. And, as usual, the top of the ranking is quite lumped up.

If we compare gemini_breiman and gemini_raw, altough gemini got a slighly better result, the level of effort was significantly larger than gemini_breiman, whose final f1_score was quite close but with a very parsimonious script.

It’s always good to keep in mind when over-optimizing for a certain class of models is providing diminishing returns. This is a class context, and I capped how complicated the models you could use, so it was worth the effort to push RF/GD to the limit. But in a broader context, when the performance starts saturating like this, either it’s better to call it a day if you’ve reached the goal, or, more crucially, seek for a major conceptual change on what you are doing, like trying methods from a very different class or doing some very big change to the model.

As we will see in the last lab, one thing that makes deep learning such a rich framework, is that you can formulate very different model classes within the same framework by redesigning the network architecture, and that can give you a wide range of possible ways you can improve your model.

Whereas for RF/GD you only have a handful of hparams which will usually at best give you some marginal gains and get saturated very quickly, neural networks have a magical hparam called “architecture” which allows for such distinct variations that they’ve been the single major paradigm in the field for more than 10 years. But their dominance relied on re-inventing new and more powerful architectures every now and then.

That said though, for certain types of data, like tabular data, RF/GD is taken to be SOTA as you indirectly found out (no one on the top 5 ended up using a kernel method, as far I remember). Some recent approaches such as TabPFN are slowly starting to break the dominance of GD for tabular data. But that’s a very recent thing.