Lab 10 - Classification comptetition Round Up

Author

Lucas Schwengber / Erez Buchweitz / Josh Davis

1 Load data

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

dir_path = r"../../datasets/diabetes/"
X = pd.read_parquet(f"{dir_path}X_train.parquet")
y = pd.read_parquet(f"{dir_path}y_train.parquet").squeeze()
X.shape, y.shape

((200000, 21), (200000,))

X_test = pd.read_parquet(f"{dir_path}X_test.parquet")

When you look at just these raw statistics from the training data, one thing should pop out to you immediately: you have massively more observations than features. This is the scenario where, if you have a prediction task, like we have here, you should throw your statistical hat out and put on an ML hat.

What is the key difference here? The key difference is that, with this ratio between features and sample points, you should try to be as data-driven as possible. What this means is, instead of trying to explicitly encode strong “priors” (in an informal, not bayesian sense) on what the predictor should look like (e.g. linear, linear with a hand crafted feature and so on), since you have this much data, you should just use the most expressive models you can and let them figure out the right “priors”.

This works because you have much more data then features. Even if at the end of the day your model is merely a linear or quadratic function of the features, with this amount of data, a very expressive model will be able to approximate that as accurately as needed for the prediction task. The added cost of doing manual feature engineering is note that significant, as we will see.

1.1 Checking features

{col: str(np.sort(X[col].unique()).tolist()) for col in X.columns}

{'HighBP': '[0, 1]',
 'HighChol': '[0, 1]',
 'CholCheck': '[0, 1]',
 'BMI': '[12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 86, 87, 88, 89, 90, 91, 92, 95, 96, 98]',
 'Smoker': '[0, 1]',
 'Stroke': '[0, 1]',
 'HeartDiseaseorAttack': '[0, 1]',
 'PhysActivity': '[0, 1]',
 'Fruits': '[0, 1]',
 'Veggies': '[0, 1]',
 'HvyAlcoholConsump': '[0, 1]',
 'AnyHealthcare': '[0, 1]',
 'NoDocbcCost': '[0, 1]',
 'GenHlth': '[1, 2, 3, 4, 5]',
 'MentHlth': '[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]',
 'PhysHlth': '[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]',
 'DiffWalk': '[0, 1]',
 'Sex': '[0, 1]',
 'Age': '[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]',
 'Education': '[1, 2, 3, 4, 5, 6]',
 'Income': '[1, 2, 3, 4, 5, 6, 7, 8]'}

features_binary = [col for col in X.columns if X[col].nunique() == 2]
features_numeric = [col for col in X.columns if X[col].nunique() > 2]
features_numeric, features_binary

(['BMI', 'GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income'],
 ['HighBP',
  'HighChol',
  'CholCheck',
  'Smoker',
  'Stroke',
  'HeartDiseaseorAttack',
  'PhysActivity',
  'Fruits',
  'Veggies',
  'HvyAlcoholConsump',
  'AnyHealthcare',
  'NoDocbcCost',
  'DiffWalk',
  'Sex'])

1.2 Class balance

np.mean(y)

0.139155

### Check data types

X.dtypes

HighBP                  int64
HighChol                int64
CholCheck               int64
BMI                     int64
Smoker                  int64
Stroke                  int64
HeartDiseaseorAttack    int64
PhysActivity            int64
Fruits                  int64
Veggies                 int64
HvyAlcoholConsump       int64
AnyHealthcare           int64
NoDocbcCost             int64
GenHlth                 int64
MentHlth                int64
PhysHlth                int64
DiffWalk                int64
Sex                     int64
Age                     int64
Education               int64
Income                  int64
dtype: object

Important pre-processing steps: some columns have categorical values and some have numeric. We need to convert them first.

ordinal_categorical_cols = ["GenHlth", "Education", "Income"]
nominal_categorical_cols = features_binary
categorical_cols = nominal_categorical_cols + ordinal_categorical_cols

# Convert in X
for col in nominal_categorical_cols:
    X[col] = X[col].astype("category")

for col in ordinal_categorical_cols:
    X[col] = pd.Categorical(
        X[col],
        categories=sorted(X[col].dropna().unique()),
        ordered=True
    )

# Keep X_test aligned (recommended)
for col in nominal_categorical_cols:
    X_test[col] = pd.Categorical(X_test[col], categories=X[col].cat.categories)

for col in ordinal_categorical_cols:
    X_test[col] = pd.Categorical(
        X_test[col],
        categories=X[col].cat.categories,
        ordered=True
    )

# Quick check
X.dtypes

HighBP                  category
HighChol                category
CholCheck               category
BMI                        int64
Smoker                  category
Stroke                  category
HeartDiseaseorAttack    category
PhysActivity            category
Fruits                  category
Veggies                 category
HvyAlcoholConsump       category
AnyHealthcare           category
NoDocbcCost             category
GenHlth                 category
MentHlth                   int64
PhysHlth                   int64
DiffWalk                category
Sex                     category
Age                        int64
Education               category
Income                  category
dtype: object

Here a crucial decision was made. This is a place in which stats and ML differ significantly. Note that GenHlth, Education and Income in principle could be treated as either ordinal or numeric. I explicitly decided to encode them as categorical. Why? That’s because categorical variables are the least informative way of encoding the variables. If you encode them as numeric or ordinal, you are implicitly assuming some form of spatial dependency. If you were in a small sample/features regime, adding in this assumption explicitly would be important to leverage you lack of samples. However here since we have soo many samples, it is better to encode the variable in the most agnostic way possible: we do not assume any relation between the different categories. Instead we will allow the algorithm to leverage the large sample to figure any relations that might be there. In some sense encoding too much prior assumptions when you have this many data can actually work against you, as we will see shortly. Think about LLMs for example. At the very base level, you feed them tokens which are essentially one-hot encoded. Either this or you encode the tokens using some pre-trained model, which itself was trained initially having all tokens one-hot encoded. Think about this for me a moment. As humans we have a bunch of prior knowledge about which tokens should be similar to one another, and yet when training these things, we completely forget about this, and just feed the tokens one-hot encoded as if each were completely unrelated to the others. The trick is that the large sample takes care of figuring out the necessary relations for the task in a way that’s more effective than encoding them a priori.

Question: why didn’t I do the same thing for the other numerical variables? And more broadly, why shouldn’t I do that for an arbitrary numerical variable?

features_cat = [col for col in X.columns if str(X[col].dtype) == "category"]
features_numeric = [col for col in X.columns if str(X[col].dtype) == "int64"]

features_cat, features_numeric

(['HighBP',
  'HighChol',
  'CholCheck',
  'Smoker',
  'Stroke',
  'HeartDiseaseorAttack',
  'PhysActivity',
  'Fruits',
  'Veggies',
  'HvyAlcoholConsump',
  'AnyHealthcare',
  'NoDocbcCost',
  'GenHlth',
  'DiffWalk',
  'Sex',
  'Education',
  'Income'],
 ['BMI', 'MentHlth', 'PhysHlth', 'Age'])

2 Model training and evaluation scaffolding

def make_cv_pairs(n_folds, seed, X=X, y=y):
    fold = np.resize(np.arange(n_folds), X.shape[0])
    rng = np.random.default_rng(seed)
    rng.shuffle(fold)

    cv_pairs = [{
        "X_train": X.loc[fold != i, :],
        "y_train": y.loc[fold != i],
        "X_valid": X.loc[fold == i, :],
        "y_valid": y.loc[fold == i],
    } for i in np.unique(fold)]

    return cv_pairs

# Example usage
nfolds = 5
cv_pairs = make_cv_pairs(n_folds=nfolds, seed=42)
[(entry["X_train"].shape[0], entry["X_valid"].shape[0]) for entry in cv_pairs]

[(160000, 40000),
 (160000, 40000),
 (160000, 40000),
 (160000, 40000),
 (160000, 40000)]

from sklearn.metrics import f1_score, precision_recall_curve

def fit_predict_score(X_train, y_train, X_valid, y_valid, model, in_sample=False):
    model.fit(X_train, y_train)
    thresh = find_best_threshold(y_train, predict(model, X_train))
    if in_sample:
        pred = predict(model, X_train) > thresh
        return f1_score(y_train, pred)
    else:
        pred = predict(model, X_valid) > thresh
        return f1_score(y_valid, pred)
    

def predict(model, X):
    if hasattr(model, "predict_proba"):
        return model.predict_proba(X)[:,1]
    elif hasattr(model, "decision_function"):
        return model.decision_function(X)
    else:
        return model.predict(X)

def find_best_threshold(y, pred):
    y_arr = np.asarray(y)
    pred_arr = np.asarray(pred)

    valid_mask = np.isfinite(y_arr) & np.isfinite(pred_arr)
    if not np.any(valid_mask):
        return 0.5

    precision, recall, thresholds = precision_recall_curve(y_arr[valid_mask], pred_arr[valid_mask])

    # align with thresholds (precision/recall are one element longer)
    f1 = 2 * precision[:-1] * recall[:-1] / np.where(
        precision[:-1] + recall[:-1] == 0, 1, precision[:-1] + recall[:-1]
    )

    finite_f1 = np.isfinite(f1)
    if not np.any(finite_f1):
        return 0.5

    valid_idx = np.where(finite_f1)[0]
    best_idx = valid_idx[np.argmax(f1[finite_f1])]
    return thresholds[best_idx]

def cv_workflow(model, seed=42, nfolds=5, **kwargs):
    cv_pairs = make_cv_pairs(n_folds=nfolds, seed=seed)
    if nfolds == 2:
        scores = [fit_predict_score(model=model, **cv_pairs[0], **kwargs)]
    else:
        scores = [fit_predict_score(model=model, **entry, **kwargs) for entry in cv_pairs]
    return np.mean(scores)

2.1 Simple training pipeline

from sklearn.preprocessing import SplineTransformer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from datetime import datetime
from sklearn.preprocessing import OneHotEncoder

predictors = {
    "logistic": LogisticRegression(C=1),
    "lda": LinearDiscriminantAnalysis(shrinkage="auto", solver="lsqr"),
    "svm": LinearSVC(C=0.01, dual="auto")
}

def build_model(predictor):
    transformer_numeric = Pipeline(steps=[
        ('spline', SplineTransformer(degree=3, n_knots=20)),
        ('scaler', StandardScaler())
    ])
    transformer_categorical = Pipeline(steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='if_binary'))
    ])

    preprocessor = ColumnTransformer(
        transformers=[
            ('numeric', transformer_numeric, features_numeric),
            ('categorical', transformer_categorical, features_cat)
        ]
    )

    model = Pipeline([
        ("preprocessor", preprocessor),
        ("predictor", predictor),
    ])

    return model

print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")
for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=2)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")

[2026-04-13 09:42:10] end
[2026-04-13 09:42:15] logistic=0.4706737896120605
[2026-04-13 09:42:19] lda=0.4628085249768343
[2026-04-13 09:42:34] svm=0.4705021544131595
[2026-04-13 09:42:34] end

The results for the same training run if we encode as categorical only the binary variables, and the rest as numeric (keeping the splines) is the following:

[2026-04-11 13:52:40] logistic=0.4611625371039416 [2026-04-11 13:52:43] lda=0.45808937405668904 [2026-04-11 13:52:47] svm=0.4554535702471137

Note how the more expressive models like SVM and logistic regression get significant gains in improvement from this elementary change.

With just this raw logistic model with no hparam tuning already placed me in the top 30.

predictor = LogisticRegression(C=1)

model = build_model(predictor)
model.fit(X, y)
thresh = find_best_threshold(y, predict(model, X))
y_pred = predict(model, X_test) > thresh
np.save("lucas_baseline_pred.npy", y_pred)

Following up on our conceptual intuition, I will completely ignore LDA and QDA because both models are less expressive and would be overly reliant on manually adding new features.

To start our evaluation of the logistic regression model we will see whether it is overfitting or not, to see whether we are lacking expressivity or regularization.

predictor = LogisticRegression(C=100000)
model = build_model(predictor)

train_score = cv_workflow(model, seed=10, in_sample=True)
print(f"train f1 score: {train_score}")

test_error = cv_workflow(model, seed=10, in_sample=False)
print(f"test f1 score: {test_error}")

train f1 score: 0.47144534867743404
test f1 score: 0.46981010798112

Not much overfitting going on, which means the logistic regression is not expressive enough as it is. We will tune its hparams any way, just to see if there are any meaningful gains. Since the regulazation is already weak, the results above should tell us we are pretty close to the best we can get for this combination of logistic+features.

Let us check the sampling variability of the error using the bootstrap to get a sense of what is a meaningful change in the CV error. We will assume it does not change that meaningfully across models.

from tqdm import tqdm

scores = []
predictor = LogisticRegression(C=1, max_iter=100)
model = build_model(predictor)
for seed in tqdm(range(0, 1000, 100)):
    test_error = cv_workflow(model, seed=seed, in_sample=False)
    scores.append(test_error)

print(f"Std test error: {np.std(scores)}")

100%|██████████| 10/10 [07:49<00:00, 46.96s/it]

Std test error: 0.0003585941661384227

3 Tune hparams for logistic

I will only try to tune regularization strength and class weights.

Finding good scales for regulatization

grid_C = [1e-3, 1e-1, 1, 1e1, 1e3, 1e6]

predictors = {
    C: LogisticRegression(C=C)
    for C in [0.1, 1, 1e1, 1e3, 1e6]
}
for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=2, seed=10, in_sample=False)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")

[2026-04-13 09:49:01] 0.1=0.47155742969644504
[2026-04-13 09:49:06] 1=0.4716089205272232
[2026-04-13 09:49:13] 10.0=0.4714083372677535
[2026-04-13 09:49:18] 1000.0=0.47155917414621945
[2026-04-13 09:49:25] 1000000.0=0.47155917414621945
[2026-04-13 09:49:25] end

# formula from sklearn for class weight
base_class_weight = 1/(2*np.mean(y)) 

print(base_class_weight)

class_weight_grid = [0.75, 1, 3.6, 10]

3.5931155905285475

predictors = {
    cw: LogisticRegression(C=1, class_weight={1:cw, 0:1/(2-1/cw)})
    for cw in class_weight_grid
}
for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=2, seed=10, in_sample=False)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")

[2026-04-13 09:50:51] 0.75=0.471191066306497
[2026-04-13 09:50:57] 1=0.4716089205272232
[2026-04-13 09:51:01] 3.6=0.4711612126518551
[2026-04-13 09:51:07] 10=0.46974615507423767
[2026-04-13 09:51:07] end

from itertools import product

updated_C_grid = [1e-1, 1, 10]
updated_weight_grid = [0.75, 1, 3.6]

grid = list(product(updated_C_grid, updated_weight_grid))

predictors = {
    f"(C={C}, cw={cw})": LogisticRegression(C=C, class_weight={1:cw, 0:1/(2-1/cw)})
    for C, cw in grid
}

for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")

[2026-04-13 09:56:30] (C=0.1, cw=0.75)=0.4701561141414578
[2026-04-13 09:56:50] (C=0.1, cw=1)=0.4700004190191736
[2026-04-13 09:57:10] (C=0.1, cw=3.6)=0.4696269553925762
[2026-04-13 09:57:32] (C=1, cw=0.75)=0.47006408953876333
[2026-04-13 09:57:55] (C=1, cw=1)=0.46993197092635947
[2026-04-13 09:58:18] (C=1, cw=3.6)=0.4696297170608033
[2026-04-13 09:58:47] (C=10, cw=0.75)=0.4699358590786124
[2026-04-13 09:59:17] (C=10, cw=1)=0.46949377588414504
[2026-04-13 09:59:43] (C=10, cw=3.6)=0.4696041031786275
[2026-04-13 09:59:43] end

logistic_hparams = {
    "C": 1,
    "class_weight": {1: 0.75, 0: 1/(2-1/0.75)}
}

SVM:

grid_C = [1e-3, 1e-1, 1, 1e1, 1e3, 1e6]

predictors = {
    C: LinearSVC(C=C, dual="auto")
    for C in grid_C
}
for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")

[2026-04-13 10:02:51] 0.001=0.46638710690409685
[2026-04-13 10:03:38] 0.1=0.4672905743090041
[2026-04-13 10:04:28] 1=0.4674689112763988
[2026-04-13 10:05:27] 10.0=0.46750884270792076
[2026-04-13 10:06:25] 1000.0=0.46745973596271434
[2026-04-13 10:07:22] 1000000.0=0.46745973596271434
[2026-04-13 10:07:22] end

predictors = {
    cw: LinearSVC(C=1, class_weight={1:cw, 0:1/(2-1/cw)})
    for cw in class_weight_grid
}

for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")

[2026-04-13 10:09:55] 0.75=0.4666714366569618
[2026-04-13 10:10:44] 1=0.4674689112763988
[2026-04-13 10:12:16] 3.6=0.4698845813182954
[2026-04-13 10:13:19] 10=0.4672258720079015
[2026-04-13 10:13:19] end

updated_C_grid = [0.1, 1, 10]
updated_weight_grid = [0.75, 1, 3.6]

grid = list(product(updated_C_grid, updated_weight_grid))

predictors = {
    f"(C={C}, cw={cw})": LinearSVC(C=C, class_weight={1:cw, 0:1/(2-1/cw)})
    for C, cw in grid
}

for nm, predictor in predictors.items():
    model = build_model(predictor)
    score = cv_workflow(model, nfolds=2, seed=10, in_sample=False)
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] {nm}={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")

[2026-04-13 10:16:02] (C=0.1, cw=0.75)=0.46818788894887897
[2026-04-13 10:16:21] (C=0.1, cw=1)=0.4696046478673657
[2026-04-13 10:16:49] (C=0.1, cw=3.6)=0.47180109157064887
[2026-04-13 10:17:04] (C=1, cw=0.75)=0.4682738230611781
[2026-04-13 10:17:34] (C=1, cw=1)=0.46964503243717953
[2026-04-13 10:18:21] (C=1, cw=3.6)=0.4713025798829574
[2026-04-13 10:18:39] (C=10, cw=0.75)=0.4682605111294312
[2026-04-13 10:19:22] (C=10, cw=1)=0.46948197241340245
[2026-04-13 10:25:44] (C=10, cw=3.6)=0.47134772814680514
[2026-04-13 10:25:44] end

svc_hparams = {
    "C": 0.1,
    "class_weight": {1: 3.6, 0: 1/(2-1/3.6)}
}

Let us check overfitting for the best models soo far to see if we need more expressivity

logistic = LogisticRegression(**logistic_hparams)
model = build_model(logistic)

train_score = cv_workflow(model, nfolds=3, seed=10, in_sample=True)
print(f"train f1 logistic: {train_score}")
test_error = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
print(f"test f1 logistic: {test_error}")

svc = LinearSVC(**svc_hparams)
model = build_model(svc)

train_score = cv_workflow(model, nfolds=3, seed=10, in_sample=True)
print(f"train f1 svc: {train_score}")

test_error = cv_workflow(model, nfolds=3, seed=10, in_sample=False)
print(f"test f1 svc: {test_error}")

train f1 logistic: 0.47138173560527696
test f1 logistic: 0.47006408953876333
train f1 svc: 0.47117195020562536
test f1 svc: 0.46997019485179575

We have very little overfitting. This suggests the methods could use more expressivity. One approach we could take now would be to manually add in more features, as some of you did. That’s a reasonable approach and it led to some of the best scores. I however opt for a more parsimonious approach in light of what I elaborated on early on. I believe our feature representation is already expressive enough. What we need is a method that is able to natively explore more non-linear functions of those features. That suggests trying boosting.

4 Gradient boosting

Sklearn has two implementations of gradient boosting. I am picking the most efficient to run so I can explore more hparams quickly.

from sklearn.ensemble import HistGradientBoostingClassifier

model = HistGradientBoostingClassifier()

score = cv_workflow(model, nfolds=2)

/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(

l2_grid = [0, 1, 10]
lr_grid = [0.01, 0.1, 1]
max_lfeaves_grid = [30, 100]
class_weight_grid = [0.75, 1, 2, 3.6]

full_grid = list(product(l2_grid, lr_grid, max_lfeaves_grid, class_weight_grid))
results = {}
for l2, lr, max_leaves, cw in full_grid:
    model = HistGradientBoostingClassifier(
        l2_regularization=l2,
        learning_rate=lr,
        max_leaf_nodes=max_leaves,
        class_weight={1:cw, 0:1/(2-1/cw)}
    )
    score = cv_workflow(model, nfolds=2, seed=10, in_sample=False)
    results[(l2, lr, max_leaves, cw)] = score
    print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] (l2={l2}, lr={lr}, max_leaves={max_leaves}, cw={cw})={score}")
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] end")

sorted_results = sorted(results.items(), key=lambda x: x[1], reverse=True)
sorted_results

[((1, 0.1, 30, 1), 0.47397189708954585),
 ((1, 0.1, 30, 3.6), 0.4739598881862838),
 ((1, 0.1, 30, 0.75), 0.47355342563010294),
 ((10, 0.1, 30, 1), 0.4734924659730116),
 ((10, 0.1, 30, 3.6), 0.4731454213995232),
 ((0, 0.1, 30, 3.6), 0.47312541473125413),
 ((10, 0.1, 100, 2), 0.4727006428164583),
 ((0, 0.1, 30, 2), 0.4725328147788041),
 ((0, 0.1, 100, 3.6), 0.47230571060541005),
 ((10, 0.1, 30, 2), 0.4721347790602775),
 ((1, 0.1, 30, 2), 0.4718932443703086),
 ((1, 0.1, 100, 3.6), 0.47136248485516025),
 ((0, 0.1, 30, 1), 0.4712298078056833),
 ((1, 0.1, 100, 2), 0.47118556553168295),
 ((10, 0.1, 100, 3.6), 0.47114688594840315),
 ((1, 0.01, 100, 0.75), 0.4706592332892695),
 ((0, 0.01, 100, 1), 0.47044522968197877),
 ((10, 0.1, 30, 0.75), 0.47032138333284235),
 ((10, 0.1, 100, 0.75), 0.47022146407458604),
 ((0, 0.1, 30, 0.75), 0.47020167003285807),
 ((0, 0.1, 100, 2), 0.4700665188470067),
 ((1, 0.01, 100, 1), 0.4699048462049126),
 ((0, 0.01, 100, 2), 0.46963495575221237),
 ((1, 0.1, 100, 1), 0.4695717074290963),
 ((10, 0.01, 100, 1), 0.4693586968903064),
 ((1, 0.1, 100, 0.75), 0.4692389758179232),
 ((10, 0.01, 100, 0.75), 0.4691295970511854),
 ((10, 0.01, 100, 2), 0.4691264857289168),
 ((1, 0.01, 100, 2), 0.46912533559417835),
 ((0, 0.01, 100, 0.75), 0.46909183413889594),
 ((0, 0.1, 100, 1), 0.46778086712090194),
 ((0, 0.01, 100, 3.6), 0.46741679622651455),
 ((1, 0.01, 100, 3.6), 0.4673497231213718),
 ((10, 0.01, 100, 3.6), 0.4670399133104569),
 ((1, 0.01, 30, 0.75), 0.46703837323712694),
 ((10, 0.01, 30, 1), 0.4665549615681268),
 ((10, 0.01, 30, 0.75), 0.4657234339509045),
 ((1, 0.01, 30, 1), 0.46565627680869315),
 ((0, 0.01, 30, 0.75), 0.46562022503125433),
 ((10, 0.1, 100, 1), 0.4653063752051274),
 ((1, 0.01, 30, 2), 0.46506200645844076),
 ((0, 0.01, 30, 1), 0.4645537497975271),
 ((0, 0.01, 30, 2), 0.4643511943189154),
 ((10, 0.01, 30, 2), 0.46387301915570145),
 ((0, 0.1, 100, 0.75), 0.4635676861216035),
 ((1, 0.01, 30, 3.6), 0.46193345559339793),
 ((0, 0.01, 30, 3.6), 0.4619190209608124),
 ((10, 0.01, 30, 3.6), 0.4615306433580648),
 ((10, 1, 30, 2), 0.45838148659695016),
 ((10, 1, 30, 3.6), 0.4577893354065575),
 ((1, 1, 30, 2), 0.4569364312074432),
 ((10, 1, 30, 0.75), 0.4559430071241095),
 ((1, 1, 30, 3.6), 0.45558448677538715),
 ((10, 1, 30, 1), 0.4555018972059331),
 ((0, 1, 30, 1), 0.4540675302151645),
 ((0, 1, 30, 3.6), 0.45280931433237626),
 ((1, 1, 30, 1), 0.4518628110289173),
 ((0, 1, 30, 2), 0.4518586628136083),
 ((10, 1, 100, 3.6), 0.4484416539397025),
 ((10, 1, 100, 2), 0.44667545360217614),
 ((0, 1, 100, 2), 0.44229451585496204),
 ((0, 1, 100, 3.6), 0.4418382806800965),
 ((1, 1, 100, 3.6), 0.4410955171004224),
 ((1, 1, 30, 0.75), 0.44021363145806236),
 ((0, 1, 30, 0.75), 0.4394418604651163),
 ((10, 1, 100, 0.75), 0.43937093934057925),
 ((10, 1, 100, 1), 0.43920722017330815),
 ((0, 1, 100, 1), 0.4386407558514065),
 ((1, 1, 100, 2), 0.4374098228198765),
 ((0, 1, 100, 0.75), 0.4219705041215724),
 ((1, 1, 100, 0.75), 0.4197631686402613),
 ((1, 1, 100, 1), 0.41947907826881653)]

recompute error estimates to get more reliable results

top_10 = [{"l2": key[0], "lr": key[1], "max_leaves": key[2], "cw": key[3]} for key, _ in sorted_results[:10]]

for param in top_10:
    model = HistGradientBoostingClassifier(
    l2_regularization=param["l2"],
    learning_rate=param["lr"],
    max_leaf_nodes=param["max_leaves"],
    class_weight={1:param["cw"], 0:1/(2-1/param["cw"])}
)
    test_error = cv_workflow(model, nfolds=10, seed=10, in_sample=False)
    print(f"params: {param}")
    print(f"test f1 grad boost: {test_error}")

The best hparams from this second more accurate run were “l2”: 1, “lr”: 0.1, “max_leaves”: 30, “cw”: 3.6. Let us do a final overfitting sanity check.

grad_boost_hparams = {"l2": 1, "lr": 0.1, "max_leaves": 30, "cw": 3.6}

model = HistGradientBoostingClassifier(
    l2_regularization=grad_boost_hparams["l2"],
    learning_rate=grad_boost_hparams["lr"],
    max_leaf_nodes=grad_boost_hparams["max_leaves"],
    class_weight={1:grad_boost_hparams["cw"], 0:1/(2-1/grad_boost_hparams["cw"])}
)

train_score = cv_workflow(model, nfolds=10, seed=5, in_sample=True)
print(f"train f1 grad boost: {train_score}")

test_error = cv_workflow(model, nfolds=10, seed=5, in_sample=False)
print(f"test f1 grad boost: {test_error}")

/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(

train f1 grad boost: 0.4836779370286172

/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(
/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(

test f1 grad boost: 0.4721004154120754

model.fit(X, y)
thresh = find_best_threshold(y, predict(model, X))
y_pred = predict(model, X_test) > thresh
np.save("lucas_final_pred.npy", y_pred)

/home/lrs/estudos/berkeley/stat154_spring_2026/spring-2026-private/stat154/lib/python3.12/site-packages/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py:400: FutureWarning: The categorical_features parameter will change to 'from_dtype' in v1.6. The 'from_dtype' option automatically treats categorical dtypes in a DataFrame as categorical features.
  warnings.warn(

We are indeed overfitting. If we wanted to try to get an even better score we could try now to reduce overfitting a little bit by thinking about principled ways of reducing regularization. But in principle our hparam sweep over the boosting parameters already did that, and larger regularization did not seem to improve the final results. What seems to be clear though, is that our exploration of boosting is giving our diminishing returns at this point already. There are some tweaks we could try inside the boosting world, as it is the most advanced model we can use. We instead opt for stopping here under the view that at this point what we would gain from further explorations of boosting would not be significantly better from what we already have.

5 Leaderboard

group	f1 score	rank median	rank Q1	rank Q3	precision	recall
gbjj2	0.4675	5	2	9	0.3892	0.5854
brand_adc	0.4674	6	3	9	0.3816	0.6029
bill&richard	0.4673	6	3	10	0.3762	0.6168
Deni2	0.4673	7	3	11	0.3824	0.6004
hello	0.4671	8	4	12	0.3809	0.6036
gemini_raw	0.4666	12	7	16	0.3811	0.6017
sourdough3	0.4663	13	9	17	0.3794	0.6048
什么时候能更新完讲义	0.4662	14	9	18	0.3893	0.5811
epicfire77day2	0.4661	14	9	18	0.3843	0.5921
MJ2	0.4661	14	9	19	0.3866	0.5867
lucas_final_pred	0.4660	15	11	18	0.3814	0.5989
Dragon_Warrior_v2	0.4657	16	12	20	0.3797	0.6023
lab9ez	0.4657	17	12	21	0.3681	0.6337
gemini_breiman	0.4653	19	15	21	0.3778	0.6057
8279.2	0.4652	20	16	22	0.3786	0.6032
567	0.4649	20	17	23	0.3772	0.6056
aaa123___	0.4648	21	18	24	0.3811	0.5956
kdot	0.4643	23	19	25	0.3721	0.6174
erez	0.4641	23	20	26	0.3754	0.6077
Darkeee	0.4639	24	20	26	0.3829	0.5886
apeman27-2	0.4629	26	24	29	0.3709	0.6156
zhej2	0.4624	27	26	29	0.3762	0.6000
Rith_God (1)	0.4620	28	26	30	0.3799	0.5896
lucas_baseline_pred	0.4616	29	27	31	0.3746	0.6013
lemon	0.4613	30	27	32	0.3800	0.5870
concurrentenrollment	0.4608	31	29	32	0.3743	0.5995
yay2	0.4593	34	32	36	0.3733	0.5969
T	0.4592	34	32	36	0.3753	0.5912
pollen (2)	0.4590	34	32	36	0.3737	0.5948
Reda	0.4590	35	32	37	0.3652	0.6174
Keval_Lab8	0.4585	36	33	38	0.3490	0.6683
lovestat	0.4580	37	34	39	0.3723	0.5951
purpleapple	0.4580	36	34	39	0.3813	0.5733
1001	0.4566	39	37	42	0.3522	0.6492
delicious_coffee	0.4561	40	38	46	0.3589	0.6255
kaihong ren 1	0.4557	41	40	45	0.3629	0.6122
ardak	0.4549	44	41	48	0.3592	0.6204
willster	0.4548	44	42	48	0.3554	0.6314
han2159	0.4547	45	42	49	0.3521	0.6419
letsTryIt	0.4545	46	43	51	0.3532	0.6374
YNWA	0.4542	49	45	53	0.3596	0.6164
Cam	0.4541	50	45	54	0.3554	0.6287
bears	0.4541	50	46	55	0.3554	0.6286
154_Lab_7_Submission	0.4541	50	47	54	0.3602	0.6141
Classicals	0.4538	52	47	58	0.3505	0.6435
Stat154 too hard	0.4537	53	48	56	0.3620	0.6079
154lab111	0.4536	53	49	57	0.3627	0.6055
sg1	0.4536	54	49	59	0.3503	0.6434
my_name (2)	0.4535	54	48	58	0.3666	0.5947
edwards	0.4535	55	51	59	0.3626	0.6053
dexhome	0.4534	56	52	59	0.3625	0.6053
Sid	0.4533	56	50	60	0.3687	0.5883
plsfix	0.4532	57	50	61	0.3688	0.5878
Pengbo	0.4532	57	52	61	0.3680	0.5898
msstumble	0.4531	57	46	63	0.3481	0.6490
pearl	0.4531	58	54	61	0.3590	0.6141
Beta	0.4530	58	51	62	0.3742	0.5738
OJP_Final	0.4528	60	54	63	0.3468	0.6522
arya_deshmukh	0.4525	61	53	64	0.3536	0.6281
corn	0.4494	65	65	65	0.3597	0.5985
anon_stat234	0.4471	66	66	67	0.3172	0.7569
LanaDelRey	0.4469	66	66	67	0.3212	0.7345
wupenghao2	0.4401	68	68	69	0.3055	0.7863
vivreplm	0.4394	69	69	70	0.3054	0.7828
Pavel_lab9	0.4394	69	69	70	0.3100	0.7538
midterm_good_luck	0.4379	71	71	71	0.3073	0.7618
jiver2	0.4336	72	72	72	0.3102	0.7203
pd	0.4300	73	73	73	0.2910	0.8234
taeyoungkim	0.4247	74	74	75	0.4221	0.4273
cvvc_121	0.4239	75	74	75	0.4511	0.3997
Radon-Nikodym-2	0.4114	76	76	76	0.3666	0.4687
dp	0.3979	78	77	78	0.3133	0.5453
o2-4	0.3979	78	77	79	0.3123	0.5480
cpv	0.3977	78	78	79	0.3124	0.5472
advanced_v2	0.3288	80	80	80	0.5020	0.2444
pg	0.2855	81	81	81	0.4939	0.2008
hamster	0.2764	82	82	82	0.5475	0.1848
lost_and_confused2	0.2466	83	83	83	0.5192	0.1617
Prediction SCDP	0.2278	84	84	84	0.5250	0.1454
silly	0.1929	86	86	86	0.5551	0.1167
tent2	0.1591	87	87	87	0.5740	0.0923
Cole	0.1366	88	88	88	0.4472	0.0806
elle	0.1155	89	89	90	0.5793	0.0641
billhewang	0.1153	90	89	90	0.5718	0.0641
pipis	0.0000	91	91	91	0.0000	0.0000

6 What LLMs did

Gemini raw did some weird pipeline, but its final prediction was just running gradient boosting with the default parameter choice from the sklearn package. Nonetheless, it did one thing significantly better than I did: it explored more methods, instead of wasting too much time tuning the hparams of a single method. This allowed it to find out that gradient boosting was top performing merely with its raw hparam choices.

Gemini Breiman went straight for random forests and then the HistGradientBoosting like I did.

The rest was pretty much standard like everyone, CV for error estimation. But the LLMs were quite lazy with hparam tuning. As the empirical results show, that was mostly ok. The biggest differences came from the methods used, not the hparam tuning itself.

7 The tied top 5 from the class

gbjj2 did random forests and gradientboosting with a wider hparam sweep than I did. Also they added some manually engineered features including interactions. There was a final step of averaging the error of RF and GB which I believe might have made the predictions worse, but the final score was high any way, which means one of the two averaged models was individually very good.
Deni2 using light gbm but with much more background data processing and manual inspection. They manually added several interactions which they found through exploring the data.
bill and richard did some heavy manual feature engineering too and did a more careful choice of which hparams to test, they also worked around HistGD, GD and random forests.
hello did similarly to the previous two: some heavy manual feature addition plus a very large hparam sweep with light GBM.
brand adc search over a very wide class of models and wide range of hparams. There was not that much manual engineering, it was more about scaling.

Overall all top 5 submissions were very good. They’ve put in a lot of work and it paid off. The first 4 leveraged a lot adding in more features, specially interactions, and increased a lot the amount of degrees of freedom to optimize over, even if they were not entirely sure whether the addition was worthline.

That’s the spirit. The more degrees of freedom you have to optimize over, and the more computational power you are willing to leverage to test things, the better the results will be.

I’d just like to point out, as usual, though, that the gains from saturating on a narrow set of ideas can become marginal pretty quickly. Everyone at the top of the ranking essentially ended up using a variant of RF/GB with just some small tweaks of adding more features or doing larger hparam sweeps. And, as usual, the top of the ranking is quite lumped up.

If we compare gemini_breiman and gemini_raw, altough gemini got a slighly better result, the level of effort was significantly larger than gemini_breiman, whose final f1_score was quite close but with a very parsimonious script.

It’s always good to keep in mind when over-optimizing for a certain class of models is providing diminishing returns. This is a class context, and I capped how complicated the models you could use, so it was worth the effort to push RF/GD to the limit. But in a broader context, when the performance starts saturating like this, either it’s better to call it a day if you’ve reached the goal, or, more crucially, seek for a major conceptual change on what you are doing, like trying methods from a very different class or doing some very big change to the model.

As we will see in the last lab, one thing that makes deep learning such a rich framework, is that you can formulate very different model classes within the same framework by redesigning the network architecture, and that can give you a wide range of possible ways you can improve your model.

Whereas for RF/GD you only have a handful of hparams which will usually at best give you some marginal gains and get saturated very quickly, neural networks have a magical hparam called “architecture” which allows for such distinct variations that they’ve been the single major paradigm in the field for more than 10 years. But their dominance relied on re-inventing new and more powerful architectures every now and then.

That said though, for certain types of data, like tabular data, RF/GD is taken to be SOTA as you indirectly found out (no one on the top 5 ended up using a kernel method, as far I remember). Some recent approaches such as TabPFN are slowly starting to break the dominance of GD for tabular data. But that’s a very recent thing.