Lab 11 Tutorial - Python setup and quick demonstrantion of neural network training

Setup: Creating a Virtual Environment

We recommend working in a dedicated virtual environment to keep dependencies isolated. Pick one of the following options:

Option A — venv (built into Python): I used this option and got everything working smoothly :)

python -m venv stat254-env
source stat254-env/bin/activate   # Mac/Linux
stat254-env\Scripts\activate      # Windows

Option B — conda:

conda create -n stat254-env python=3.10
conda activate stat254-env

Once activated, proceed to the package installation below.

Setup: Installing PyTorch

Before running this notebook, make sure PyTorch and torchvision are installed. Find your platform below and run the corresponding command in your terminal (or uncomment it in the cell below).

Virtual environments

If you are using a virtual environment, activate it before installing. Also make sure Jupyter is using the correct kernel — in Jupyter: Kernel → Change Kernel and select your environment.

venv:

source your-env/bin/activate      # Mac/Linux
your-env\Scripts\activate         # Windows
pip install torch torchvision ipykernel
python -m ipykernel install --user --name=your-env

conda:

conda activate your-env
conda install pytorch torchvision -c pytorch ipykernel
python -m ipykernel install --user --name=your-env

Mac — Apple Silicon (M1/M2/M3)

pip install torch torchvision

PyTorch will automatically use the MPS backend for GPU acceleration. To enable it in code: device = "mps".

Mac — Intel

pip install torch torchvision

Intel Macs run CPU-only. Use device = "cpu" in the notebook.

Windows / Linux — CPU only

pip install torch torchvision

Windows / Linux — NVIDIA GPU (CUDA 12.1)

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

For other CUDA versions, use the official selector: https://pytorch.org/get-started/locally/

# Uncomment and run to install all required packages
# !pip install torch torchvision scikit-learn plotly nbformat ipykernel

import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from sklearn.datasets import make_circles
from sklearn.linear_model import LogisticRegression
from sklearn.mixture import GaussianMixture
from sklearn.metrics import adjusted_rand_score
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

torch.manual_seed(42)
np.random.seed(42)

1. A Non-Linearly Separable Dataset

We generate 1 000 points from two concentric circles: inner circle = class 0, outer circle = class 1.

The ground-truth decision boundary is circular. A point \((x_1, x_2)\) belongs to class 1 if \[x_1^2 + x_2^2 > \text{threshold.}\] This is logistic regression on the single non-linear feature \(\varphi(x) = x_1^2 + x_2^2\) — but that feature is hidden from a model that only sees raw coordinates.

X, y = make_circles(n_samples=1000, noise=0.1, factor=0.4, random_state=42)

fig, ax = plt.subplots(figsize=(5, 5))
sc = ax.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', alpha=0.6, edgecolors='k', linewidths=0.3)
ax.set_title('Raw data — NOT linearly separable')
ax.set_xlabel('x₁')
ax.set_ylabel('x₂')
plt.colorbar(sc, ax=ax, label='class')
plt.tight_layout()
plt.show()

2. Logistic Regression: Raw Features vs. the Right Feature

A standard logistic regression on \((x_1, x_2)\) cannot separate the two circles — no hyperplane can.

But if we hand-engineer \(\varphi(x) = x_1^2 + x_2^2\), a logistic regression achieves near-perfect accuracy. The challenge: finding the right \(\varphi\) requires domain knowledge.

Neural networks sidestep this by learning \(\varphi\) from data.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

lr_raw = LogisticRegression()
lr_raw.fit(X_scaled, y)
acc_raw = lr_raw.score(X_scaled, y)

r_sq = (X[:, 0]**2 + X[:, 1]**2).reshape(-1, 1)
lr_oracle = LogisticRegression()
lr_oracle.fit(r_sq, y)
acc_oracle = lr_oracle.score(r_sq, y)

print(f'LR on raw (x₁, x₂):       accuracy = {acc_raw:.2%}')
print(f'LR on φ(x) = r²  (oracle): accuracy = {acc_oracle:.2%}')

LR on raw (x₁, x₂):       accuracy = 50.20%
LR on φ(x) = r²  (oracle): accuracy = 99.90%

3. Training a Neural Network

We train a small MLP on raw \((x_1, x_2)\). For this simple dataset a tiny network suffices: one hidden layer of 8 units, followed by an 8-dimensional embedding layer. After training, we will extract this embedding and visualize it.

import matplotlib.pyplot as plt

def draw_nn_architecture():
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.set_xlim(0, 10)
    ax.set_ylim(-1, 7)
    ax.axis('off')

    colors = {'input': '#AED6F1', 'hidden': '#D5D8DC',
              'embed': '#FAD7A0', 'output': '#A9DFBF'}

    # (x, n_neurons, type)
    layers = [
        (1.2, 2, 'input'),
        (3.8, 8, 'hidden'),
        (6.4, 8, 'embed'),
        (9.0, 2, 'output'),
    ]
    layer_labels = ['Input\n$\\mathbb{R}^{2}$',
                    'Hidden\n$\\mathbb{R}^{8}$',
                    'Embedding\n$\\mathbb{R}^{8}$',
                    'Output\n$\\mathbb{R}^{2}$']
    act_labels = ['Linear + ReLU', 'Linear + ReLU', 'Linear']

    r = 0.28

    def neuron_ys(n):
        spacing = 0.72
        return [3.0 + (i - (n - 1) / 2) * spacing for i in range(n)]

    layer_centers = [(x, neuron_ys(n), ltype) for x, n, ltype in layers]

    # Connections
    for i in range(len(layer_centers) - 1):
        x1, ys1, _ = layer_centers[i]
        x2, ys2, _ = layer_centers[i + 1]
        for y1 in ys1:
            for y2 in ys2:
                ax.plot([x1 + r, x2 - r], [y1, y2],
                        color='#CCCCCC', linewidth=0.55, zorder=1)

    # Neurons
    for (x, ys, ltype) in layer_centers:
        for y in ys:
            ax.add_patch(plt.Circle((x, y), r, color=colors[ltype],
                                    ec='#555555', linewidth=1.2, zorder=2))

    # Layer labels
    for idx, (x, ys, ltype) in enumerate(layer_centers):
        col = '#C0760A' if ltype == 'embed' else 'black'
        fw = 'bold' if ltype == 'embed' else 'normal'
        ax.text(x, -0.35, layer_labels[idx], ha='center', va='top',
                fontsize=9.5, fontweight=fw, color=col, multialignment='center')

    # Activation labels
    for i in range(len(layer_centers) - 1):
        x1, x2 = layer_centers[i][0], layer_centers[i + 1][0]
        ax.text((x1 + x2) / 2, 6.2, act_labels[i], ha='center',
                fontsize=8, color='#888888', multialignment='center')

    # embedding() annotation
    ex = layer_centers[2][0]
    eys = layer_centers[2][1]
    y_top, y_bot = max(eys) + r + 0.05, min(eys) - r - 0.05
    ax.annotate('', xy=(ex + r + 0.05, y_bot), xytext=(ex + r + 0.05, y_top),
                arrowprops=dict(arrowstyle='-', color='#C0760A', lw=1.8))
    ax.text(ex + r + 0.18, (y_top + y_bot) / 2, 'embedding()',
            ha='left', va='center', fontsize=9,
            color='#C0760A', fontstyle='italic', rotation=90)

    plt.tight_layout()
    plt.show()

draw_nn_architecture()

class MLP(nn.Module):
    def __init__(self, embed_dim=8):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(2, 8),
            nn.ReLU(),
            nn.Linear(8, embed_dim),
            nn.ReLU(),
        )
        self.classifier = nn.Linear(embed_dim, 2)

    def embedding(self, x):
        return self.encoder(x)

    def forward(self, x):
        return self.classifier(self.embedding(x))

X_tensor = torch.tensor(X_scaled, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)

model = MLP(embed_dim=16)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

losses = []
for epoch in range(500):
    model.train()
    optimizer.zero_grad()
    loss = criterion(model(X_tensor), y_tensor)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

model.eval()
with torch.no_grad():
    acc_nn = (model(X_tensor).argmax(dim=1) == y_tensor).float().mean().item()

print(f'Neural network accuracy: {acc_nn:.2%}')

plt.figure(figsize=(6, 3))
plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Cross-entropy loss')
plt.title('Training loss')
plt.tight_layout()
plt.show()

Neural network accuracy: 100.00%

Understanding the Training Loss Plot

The plot shows cross-entropy loss (y-axis) vs. epoch (x-axis).

What is an epoch? One epoch = one full pass through all 1,000 training points. At each epoch, the optimizer sees every example, computes the loss, and updates the network weights via backpropagation. We run 500 epochs, meaning the network sees the full dataset 500 times.

This is in-sample error. The loss is computed on the exact same data used to train the network — it measures how well the model fits its training set, not how well it generalizes to unseen data. A steadily decreasing curve means the network is successfully learning the training data. The plateau toward the end indicates convergence: additional epochs would yield little further improvement.

4. Visualizing the Learned Representation with t-SNE

We extract the 16-dimensional embeddings \(z = \text{encoder}(x)\) and project them to 2D with t-SNE.

If the network has learned good features, the two classes should form compact, well-separated clusters in the projected space — in stark contrast to the original input space.

model.eval()
with torch.no_grad():
    raw_Z = model.forward(X_tensor).numpy()          # shape: (1000, 2)
    Z = model.embedding(X_tensor).numpy()            # shape: (1000, 16)
    H = model.encoder[:2](X_tensor).numpy()          # shape: (1000, 8)  — first hidden layer (Linear + ReLU)

tsne = TSNE(n_components=2, perplexity=30, random_state=42, max_iter=1000)
Z_2d = tsne.fit_transform(Z)

tsne_h = TSNE(n_components=2, perplexity=30, random_state=42, max_iter=1000)
H_2d = tsne_h.fit_transform(H)

fig, axes = plt.subplots(1, 4, figsize=(20, 5))

axes[0].scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', alpha=0.6, edgecolors='k', linewidths=0.3)
axes[0].set_title('Input space\n(NOT linearly separable)')
axes[0].set_xlabel('x₁')
axes[0].set_ylabel('x₂')

axes[1].scatter(H_2d[:, 0], H_2d[:, 1], c=y, cmap='bwr', alpha=0.6, edgecolors='k', linewidths=0.3)
axes[1].set_title('Hidden layer 1 (t-SNE)\n(8D → 2D)')
axes[1].set_xlabel('t-SNE dim 1')
axes[1].set_ylabel('t-SNE dim 2')

axes[2].scatter(Z_2d[:, 0], Z_2d[:, 1], c=y, cmap='bwr', alpha=0.6, edgecolors='k', linewidths=0.3)
axes[2].set_title('Learned embedding (t-SNE)\n(linearly separable in HD!)')
axes[2].set_xlabel('t-SNE dim 1')
axes[2].set_ylabel('t-SNE dim 2')

axes[3].scatter(raw_Z[:, 0], raw_Z[:, 1], c=y, cmap='bwr', alpha=0.6, edgecolors='k', linewidths=0.3)
axes[3].set_title('Raw output of the network\n(linearly separable in 1D)')
axes[3].set_xlabel('logit for class 0')
axes[3].set_ylabel('logit for class 1')

plt.tight_layout()
plt.show()

We can verify linear separability quantitatively: fit a logistic regression on the raw features vs. on the learned embedding.

lr_embed = LogisticRegression()
lr_embed.fit(Z, y)
acc_embed = lr_embed.score(Z, y)

print('Logistic regression accuracy:')
print(f'  Raw (x₁, x₂):       {acc_raw:.2%}')
print(f'  φ(x) = r² (oracle): {acc_oracle:.2%}')
print(f'  NN embedding:        {acc_embed:.2%}')
print()
print('The NN discovered a representation as good as the hand-crafted oracle feature.')

Logistic regression accuracy:
  Raw (x₁, x₂):       50.20%
  φ(x) = r² (oracle): 99.90%
  NN embedding:        100.00%

The NN discovered a representation as good as the hand-crafted oracle feature.

6. Comparison with ground-truth function being thresholded

from sklearn.decomposition import PCA

raw_Z_pc1 = PCA(n_components=1).fit_transform(raw_Z)

rs = X[:, 0]**2 + X[:, 1]**2
plt.figure(figsize=(6, 4))
plt.scatter(rs, raw_Z_pc1, c=y, cmap='bwr', alpha=0.6, edgecolors='k', linewidths=0.3)
plt.title('Oracle feature φ(x) = r²')
plt.xlabel('r²')
plt.ylabel('NN final pre-activation (1D projection)')
plt.tight_layout()
plt.show()

The NN has effectively learned a monotonic (and slightly noisy) transformation of the function \(x_1^2 + x_2^2\). Although this is not the exact feature used to generate the data, and we could have manually figured out the right feature transformation, for any realistic problems with a large number of feature it is much more effective let the NN learn the features by just feeding it a lot of data.

You could have used a Kernel SVM to get similarily good results and the separability in high-dimensions (by picking for example and RBF kernel with a sufficiently small bandwidth). The main weakness of the kernel approach however, is that it’s highly sensitive to your kernel choice. If the function you are thresholding is highly non-smooth in your features (which it often can be) any kernel you can handpick will probably give you sub-optimal results. A simple example is the next challenge regarding images. Here and RBF kernel would make very little sense as two images can be very far in euclidean space but have similar content, and two images can be very close in euclidean space and have completely different contents. Putting in effort to pick a good kernel is almost as cumbersome as manually designing the right features. The magic of NN is that as long as you have enough data and a sufficiently expressive architecture, it will learn the necessary features and embedding for you.

One could ask whether a kernel method in which the kernel is learned from the data instead of hand-picked could be as good as a neural network. Take this idea seriously is behind some approaches on contemporary research trying to make sense of how neural networks work (checkout RFM). However it is still an open problem showing how exactly one can achieve comparable results to NNs using kernel methods even with a data-driven kernel selection. The default wisdom is that as things stand neural networks are the prefered method if you have a lot of data and a task with a well-defined (and differentiable) objective.

7. Connection to the FashionMNIST Lab

Everything above scales directly to the full challenge:

Tutorial (toy)	Lab challenge (FashionMNIST)
2D input, 2 classes	28×28 images, 10 classes
Non-linear boundary (circles)	Complex visual boundaries (clothing categories)
MLP, 8D embedding	CNN, 3D embedding via `model.embedding(x)`

Your task: design a CNN that produces a 3D embedding where the 10 clothing categories form well-separated Gaussian clusters — just like the two circles became separable above.