Classification

Train neural networks to assign discrete labels — binary and multi-class classification with PyTorch.

On this page

Goal of the lesson
Suggested timing
Regression vs. classification
Setup
Multi-class — the blobs dataset
Build a model
From logits to predictions
Loss, optimizer, accuracy
Train
Decision boundary
Binary classification
Real-world example — heart-disease prediction
Exercises
Capstone — overfitting on the moons dataset
Recap
References

Goal of the lesson

By the end of this 3-hour session you should be able to:

explain the difference between regression and classification,
generate synthetic 2-D datasets and visualize their decision regions,
build a feed-forward neural network with non-linear activations,
choose the right loss for binary and multi-class problems,
track loss and accuracy during training,
recognize underfitting and overfitting visually,
handle real-world tabular data with mixed numerical and categorical features,
solve the moons dataset as a capstone.

Suggested timing

Block	Topic
15 min	What classification is, logits vs. probabilities
25 min	Generate the blobs dataset, build the model
25 min	Training loop with accuracy, decision boundary
15 min	Binary classification with `BCEWithLogitsLoss`
55 min	Real-world example — heart-disease prediction
45 min	Capstone — moons dataset and overfitting

Regression vs. classification

Task	Output	Loss	Final layer
Regression	A real number	`MSELoss`, `L1Loss`	`Linear` (no activation)
Binary classification	One of two classes	`BCEWithLogitsLoss`	`Linear` with 1 output (logit)
Multi-class classification	One of `K` classes	`CrossEntropyLoss`	`Linear` with `K` outputs (logits)

The five-step workflow doesn’t change. We swap the dataset, the model’s output size, and the loss.

Setup

ps title="powershell"

uv init --python 3.12 classification
cd classification
uv add torch matplotlib scikit-learn numpy

py title="main.py"

import matplotlib.pyplot as plt
import numpy as np
import torch
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from torch import nn

device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42)

Multi-class — the blobs dataset

sklearn.datasets.make_blobs generates clusters of points in 2-D — perfect for visualizing what a classifier is doing.

py title="main.py"

NUM_CLASSES = 4
NUM_FEATURES = 2

x_np, y_np = make_blobs(
    n_samples=1000,
    n_features=NUM_FEATURES,
    centers=NUM_CLASSES,
    cluster_std=1.5,
    random_state=42,
)

x = torch.from_numpy(x_np).float()
y = torch.from_numpy(y_np).long()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

print(x_train.shape, y_train.shape, y_train[:10])

A few details that matter:

Features are float32. Targets for CrossEntropyLoss must be int64 (the dtype .long() produces).
Targets are class indices (0, 1, 2, 3), not one-hot vectors. PyTorch’s loss does the one-hot conversion internally.

Visualize:

py title="main.py"

plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.RdYlBu, s=8)
plt.title("blobs")
plt.show()

You should see four colored blobs.

Try it — change the dataset

What does cluster_std=0.5 look like? cluster_std=4.0?

Show solution

With cluster_std=0.5 the blobs are tight and trivially separable. With 4.0 they overlap heavily and even a perfect classifier can’t get 100% accuracy because the labels themselves disagree in the overlap region.

Build a model

A linear model can only draw straight separators. Real data is rarely separable that way, so we add a non-linear activation between linear layers.

py title="main.py"

class BlobModel(nn.Module):
    def __init__(self, in_features: int, out_features: int, hidden: int = 8):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_features, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, out_features),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

model = BlobModel(in_features=NUM_FEATURES, out_features=NUM_CLASSES).to(device)
print(model)

The output layer has one unit per class. We deliberately leave it without an activation — those raw outputs are called logits. nn.CrossEntropyLoss applies LogSoftmax internally and is numerically more stable than computing the softmax ourselves.

From logits to predictions

Three closely-related tensors to keep straight:

Tensor	Meaning	How to obtain it
logits	Raw network output, one number per class	`model(x)`
probabilities	Softmax of logits, one per class, sum to 1	`torch.softmax(logits, dim=1)`
predictions	Index of the largest logit	`logits.argmax(dim=1)`

argmax of probabilities and argmax of logits agree, so you don’t actually need softmax to predict — only to report a confidence.

py title="main.py"

x_train, y_train = x_train.to(device), y_train.to(device)
x_test, y_test = x_test.to(device), y_test.to(device)

with torch.inference_mode():
    logits = model(x_test[:5])
    probs = torch.softmax(logits, dim=1)
    preds = logits.argmax(dim=1)

print("logits:\n", logits)
print("probs (rows sum to 1):\n", probs)
print("preds:", preds)
print("truth:", y_test[:5])

Before training, the predictions are essentially random.

Loss, optimizer, accuracy

py title="main.py"

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

def accuracy(y_true: torch.Tensor, y_pred: torch.Tensor) -> float:
    correct = (y_true == y_pred).sum().item()
    return correct / len(y_pred)

Loss tells the optimizer how to improve. Accuracy tells us how well the model is doing in human terms. They almost always disagree slightly because cross-entropy penalizes overconfident wrong answers more than confident-correct ones.

Train

py title="main.py"

EPOCHS = 100
history = []

for epoch in range(EPOCHS):
    model.train()
    logits = model(x_train)
    loss = loss_fn(logits, y_train)
    train_acc = accuracy(y_train, logits.argmax(dim=1))

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    model.eval()
    with torch.inference_mode():
        test_logits = model(x_test)
        test_loss = loss_fn(test_logits, y_test)
        test_acc = accuracy(y_test, test_logits.argmax(dim=1))

    history.append((loss.item(), test_loss.item(), train_acc, test_acc))

    if epoch % 10 == 0:
        print(
            f"epoch {epoch:3d}  loss={loss.item():.4f} acc={train_acc:.2%} "
            f"| test_loss={test_loss.item():.4f} test_acc={test_acc:.2%}"
        )

After 100 epochs you should see test accuracy around 99% — four well-separated blobs are an easy problem.

Plot loss and accuracy

py title="main.py"

losses = np.array(history)
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].plot(losses[:, 0], label="train")
axes[0].plot(losses[:, 1], label="test")
axes[0].set_title("loss"); axes[0].legend()
axes[1].plot(losses[:, 2], label="train")
axes[1].plot(losses[:, 3], label="test")
axes[1].set_title("accuracy"); axes[1].legend()
plt.show()

Healthy training: both losses decrease, both accuracies increase, and the train and test curves stay close to each other.

Decision boundary

A picture is worth a thousand metrics. Sample a grid of points across the input space, ask the model for a prediction at each, and color the result.

py title="main.py"

def plot_decision_boundary(model, x, y, title=""):
    model.eval()
    x = x.to("cpu"); y = y.to("cpu")

    x_min, x_max = x[:, 0].min() - 0.1, x[:, 0].max() + 0.1
    y_min, y_max = x[:, 1].min() - 0.1, x[:, 1].max() + 0.1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))

    grid = torch.from_numpy(np.column_stack((xx.ravel(), yy.ravel()))).float().to(device)
    with torch.inference_mode():
        preds = model(grid).argmax(dim=1).cpu().numpy().reshape(xx.shape)

    plt.contourf(xx, yy, preds, cmap=plt.cm.RdYlBu, alpha=0.6)
    plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.RdYlBu, s=8, edgecolors="k", linewidths=0.2)
    plt.title(title); plt.show()

plot_decision_boundary(model, x_test, y_test, title="trained model")

Try it — kill the activations

Comment out the nn.ReLU() lines, retrain from scratch, and plot the decision boundary again. What changes?

Show solution

Without non-linearities the network collapses to a single linear transformation, no matter how many layers it has. The decision boundaries become straight lines and accuracy drops on data that needs curved separators. Activations are what make a “deep” network actually deep.

Binary classification

For two classes you have two equivalent options:

Approach	Output size	Loss	Targets
One logit	1	`nn.BCEWithLogitsLoss`	float `0.0` / `1.0`
Two logits	2	`nn.CrossEntropyLoss`	int `0` / `1`

BCEWithLogitsLoss combines a sigmoid and binary cross-entropy in one numerically-stable step.

py title="main.py"

from sklearn.datasets import make_circles

x_np, y_np = make_circles(n_samples=1000, noise=0.05, factor=0.5, random_state=42)
x = torch.from_numpy(x_np).float()
y = torch.from_numpy(y_np).float().unsqueeze(1)        # shape (N, 1)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

class CircleModel(nn.Module):
    def __init__(self, hidden: int = 16):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, 1),                       # one logit
        )

    def forward(self, x):
        return self.net(x)

model = CircleModel().to(device)
loss_fn = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

x_train, y_train = x_train.to(device), y_train.to(device)
x_test, y_test = x_test.to(device), y_test.to(device)

for epoch in range(500):
    model.train()
    logits = model(x_train)
    loss = loss_fn(logits, y_train)
    optimizer.zero_grad(); loss.backward(); optimizer.step()

    if epoch % 50 == 0:
        with torch.inference_mode():
            preds = (torch.sigmoid(model(x_test)) > 0.5).float()
            acc = (preds == y_test).float().mean().item()
        print(f"epoch {epoch:3d}  loss={loss.item():.4f}  test_acc={acc:.2%}")

Notice:

targets are float for BCEWithLogitsLoss, not int,
prediction is sigmoid(logit) > 0.5, equivalent to logit > 0,
the model output has a trailing dim of 1 to match the target shape (N, 1).

Real-world example — heart-disease prediction

Synthetic blobs and circles are good for understanding what the model does. Now we move to real tabular data: predict whether a patient has heart disease from a small set of clinical features.

The dataset is provided by the Cleveland Clinic Foundation: 303 rows, 13 features, one binary target.

Feature	Type	Meaning
`age`	numerical	Age in years
`sex`	categorical	0 = female, 1 = male
`cp`	categorical	Chest-pain type (1–4)
`trestbps`	numerical	Resting blood pressure
`chol`	numerical	Serum cholesterol
`fbs`	categorical	Fasting blood sugar > 120 mg/dl
`restecg`	categorical	Resting ECG results
`thalach`	numerical	Maximum heart rate achieved
`exang`	categorical	Exercise-induced angina
`oldpeak`	numerical	ST depression induced by exercise
`slope`	numerical	Slope of the peak exercise ST segment
`ca`	categorical	Number of major vessels (0–3)
`thal`	categorical	`normal`, `fixed`, or `reversible`
`target`	binary	1 = heart disease, 0 = no heart disease

This is a much more realistic setup than 2-D toy data. You will learn:

how to load tabular data with pandas,
how to preprocess mixed numerical + categorical features,
how to wrap tensors in a TensorDataset and iterate them with a DataLoader,
how to run inference on a single new patient through the same pipeline.

Setup

ps title="powershell"

uv add pandas

py title="heart.py"

import pandas as pd
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42)

Loading the data

py title="heart.py"

url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
df = pd.read_csv(url)

print(df.shape)              # (303, 14)
print(df.head())
print(df["target"].value_counts())

The first thing to do with any new dataset is to look at it. df.head() shows the first rows; value_counts() checks for class imbalance. The Cleveland set is roughly balanced (165 vs. 138).

Preparing the data

Mixed-type tabular data needs two preprocessing steps:

Numerical features get standardized (zero mean, unit variance) so the network’s gradients don’t get distorted by columns with very different scales.
Categorical features get integer-encoded so they’re representable as numbers.

The cardinal rule: fit the preprocessors on the training set only, then transform both train and test. Otherwise statistics from the test set leak into training and your reported accuracy is optimistic.

py title="heart.py"

categorical = ["sex", "cp", "fbs", "restecg", "exang", "ca", "thal"]
numerical = ["age", "trestbps", "chol", "thalach", "oldpeak", "slope"]
features = numerical + categorical

train_df, test_df = train_test_split(df, test_size=0.2, random_state=1337)

scaler = StandardScaler()
train_df.loc[:, numerical] = scaler.fit_transform(train_df[numerical])
test_df.loc[:, numerical] = scaler.transform(test_df[numerical])

encoders = {}
for column in categorical:
    le = LabelEncoder()
    train_df.loc[:, column] = le.fit_transform(train_df[column])
    test_df.loc[:, column] = le.transform(test_df[column])
    encoders[column] = le

Convert the dataframes to PyTorch tensors and wrap them in a DataLoader so we can iterate them in batches:

py title="heart.py"

x_train = torch.tensor(train_df[features].values, dtype=torch.float32)
y_train = torch.tensor(train_df["target"].values, dtype=torch.float32).unsqueeze(1)
x_test  = torch.tensor(test_df[features].values, dtype=torch.float32)
y_test  = torch.tensor(test_df["target"].values, dtype=torch.float32).unsqueeze(1)

train_loader = DataLoader(TensorDataset(x_train, y_train), batch_size=32, shuffle=True)
test_loader  = DataLoader(TensorDataset(x_test, y_test), batch_size=32)

print(x_train.shape, y_train.shape)
# torch.Size([242, 13]) torch.Size([242, 1])

A batch is a subset of samples used in a single training iteration. Batched training is faster and gives smoother gradients. shuffle=True on the training loader prevents the model from memorising sample order.

Defining the model

A small two-layer multilayer perceptron (MLP) with dropout. The output is a single logit per sample — BCEWithLogitsLoss will turn it into a probability internally.

Layer	Purpose
`nn.Linear(13, 32)`	Linear transform from 13 features to 32 hidden units
`nn.ReLU`	Non-linearity
`nn.Dropout(0.5)`	Regularization — drops 50% of activations during training
`nn.Linear(32, 1)`	Linear transform to a single output (logit)

py title="heart.py"

class HeartModel(nn.Module):
    def __init__(self, input_size: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_size, 32),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(32, 1),
        )

    def forward(self, x):
        return self.net(x)

model = HeartModel(input_size=len(features)).to(device)
loss_fn = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

Training

py title="heart.py"

EPOCHS = 50

for epoch in range(EPOCHS):
    model.train()
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        logits = model(inputs)
        loss = loss_fn(logits, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    model.eval()
    with torch.inference_mode():
        train_logits = model(x_train.to(device))
        test_logits = model(x_test.to(device))
        train_acc = ((train_logits > 0).float() == y_train.to(device)).float().mean().item()
        test_acc = ((test_logits > 0).float() == y_test.to(device)).float().mean().item()
        test_loss = loss_fn(test_logits, y_test.to(device)).item()

    if (epoch + 1) % 5 == 0:
        print(
            f"epoch {epoch + 1:2d}  loss={loss.item():.4f}  "
            f"train_acc={train_acc:.2%}  test_loss={test_loss:.4f}  test_acc={test_acc:.2%}"
        )

You should reach roughly 80–85% test accuracy. A few percent of variance between runs is normal — there are only ~60 test patients.

In each iteration:

optimizer.zero_grad() clears gradients accumulated on parameters.
model(inputs) runs a forward pass.
loss.backward() computes gradients via backpropagation.
optimizer.step() updates the parameters.

model.train() enables training-mode behavior (Dropout active); model.eval() switches it off so evaluation is deterministic.

Note

With only ~240 training examples, dropout makes a real difference. Try setting nn.Dropout(0.0) and re-run — train accuracy will reach ~100% while test accuracy stays put. That’s textbook overfitting.

Inference on a new patient

To predict on a fresh sample, run it through the same preprocessing pipeline and pass the tensor through the model.

py title="heart.py"

sample = {
    "age": 80, "sex": 0, "cp": 1, "trestbps": 124, "chol": 300,
    "fbs": 1, "restecg": 2, "thalach": 150, "exang": 0,
    "oldpeak": 2.3, "slope": 3, "ca": 0, "thal": "fixed",
}

sample_df = pd.DataFrame([sample])
sample_df.loc[:, numerical] = scaler.transform(sample_df[numerical])
for column in categorical:
    sample_df.loc[:, column] = encoders[column].transform(sample_df[column])

sample_tensor = torch.tensor(sample_df[features].values, dtype=torch.float32).to(device)

model.eval()
with torch.inference_mode():
    logit = model(sample_tensor)
    proba = torch.sigmoid(logit).item()

print(f"heart-disease probability: {proba:.1%}")

A few details that matter in production:

The scaler and encoders objects must be saved alongside the model. Predicting later without them is a bug. Use pickle or joblib.
LabelEncoder.transform raises if it sees a value it didn’t see during training. Real systems handle this — for example by mapping unseen categories to a special “unknown” index.
The output is a probability, not a diagnosis. Threshold at 0.5 for a default decision; choose a different threshold to trade off false positives vs. false negatives depending on cost.

Try it — deliberate leakage

Move the scaler.fit_transform call to be applied to the whole dataframe before train_test_split. Re-run training. What happens to the test accuracy, and why is the result misleading?

Show solution

Test accuracy goes up slightly because the scaler now “knows” the distribution of the test set. In a real deployment you don’t have the test set yet — only training data. The leaked statistics make the offline number look better than what you’d actually see in production. Always fit preprocessors on training data only.

Exercises

Warm-up

Reduce cluster_std to 0.5, retrain. How does test accuracy change?
Increase cluster_std to 4.0 and add a third hidden layer. Does accuracy improve?
Replace nn.ReLU with nn.Tanh. Plot both decision boundaries and compare.

Generalization

Reduce the training set to n_samples=50 (with cluster_std=2.0). Train and evaluate. Now repeat with n_samples=2000. What changes about the decision boundary?
Train a model with no hidden layers — nn.Linear(2, 4) directly. What’s the maximum accuracy you can reach? On what kinds of datasets is it enough?

Confusion matrix

After training, build a confusion matrix:

python

from collections import Counter
preds = model(x_test).argmax(dim=1).cpu().numpy()
truth = y_test.cpu().numpy()
import numpy as np
cm = np.zeros((NUM_CLASSES, NUM_CLASSES), dtype=int)
for t, p in zip(truth, preds):
    cm[t, p] += 1
print(cm)

Which class is the easiest? The hardest?

Binary classification

Generate make_moons(n_samples=1000, noise=0.2) and train both a one-logit model and a two-logit model. Verify they give the same accuracy.
Plot the decision boundary as a smooth probability heatmap (use torch.sigmoid(model(grid)) instead of argmax).

Heart disease

Build a confusion matrix on the heart-disease test set: how many false positives and false negatives does the model produce?
Try thresholds other than 0.5. Plot precision and recall as functions of the threshold (probs > t for t between 0.1 and 0.9). Which threshold would you pick if a false negative is twice as costly as a false positive?
Add a second hidden layer to HeartModel. Does it help? Why might it not?
Pickle the trained model and the scaler + encoders to a single file with joblib.dump({"model": model.state_dict(), "scaler": scaler, "encoders": encoders}, "heart.pkl"). Load them in a fresh script and predict on the same sample.

Show solution

For exercise 8:

python

def plot_proba(model, x, y):
    model.eval()
    x_cpu = x.cpu(); y_cpu = y.cpu()
    x_min, x_max = x_cpu[:, 0].min() - 0.2, x_cpu[:, 0].max() + 0.2
    y_min, y_max = x_cpu[:, 1].min() - 0.2, x_cpu[:, 1].max() + 0.2
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))
    grid = torch.from_numpy(np.column_stack((xx.ravel(), yy.ravel()))).float().to(device)
    with torch.inference_mode():
        proba = torch.sigmoid(model(grid)).cpu().numpy().reshape(xx.shape)
    plt.contourf(xx, yy, proba, levels=20, cmap=plt.cm.RdBu)
    plt.scatter(x_cpu[:, 0], x_cpu[:, 1], c=y_cpu.squeeze(), cmap=plt.cm.RdBu, s=10, edgecolors="k")
    plt.show()

Capstone — overfitting on the moons dataset

The moons dataset has two interleaving half-moons that no straight line can separate. We will:

train a tiny model and watch it underfit,
train a too-big model and watch it overfit,
apply two regularization tools — weight decay and dropout — and find a sweet spot.

Build the data

py title="capstone.py"

import matplotlib.pyplot as plt
import numpy as np
import torch
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from torch import nn

device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(0)

x_np, y_np = make_moons(n_samples=300, noise=0.30, random_state=0)
x = torch.from_numpy(x_np).float()
y = torch.from_numpy(y_np).long()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
x_train, y_train = x_train.to(device), y_train.to(device)
x_test, y_test = x_test.to(device), y_test.to(device)

plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.RdYlBu, s=10)
plt.title("moons (noisy)")
plt.show()

The data is deliberately noisy — some red points are inside the blue moon and vice versa. A perfect classifier on this data does not exist; the best we can do is recover the underlying shape.

A reusable training function

py title="capstone.py"

def train(model, epochs=2000, lr=0.01, weight_decay=0.0):
    model = model.to(device)
    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)

    history = []
    for epoch in range(epochs):
        model.train()
        loss = loss_fn(model(x_train), y_train)
        optimizer.zero_grad(); loss.backward(); optimizer.step()

        model.eval()
        with torch.inference_mode():
            train_acc = (model(x_train).argmax(1) == y_train).float().mean().item()
            test_acc = (model(x_test).argmax(1) == y_test).float().mean().item()
        history.append((loss.item(), train_acc, test_acc))
    return model, history

Underfitting — too small

py title="capstone.py"

class TinyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.Linear(2, 2), nn.Tanh(), nn.Linear(2, 2))
    def forward(self, x): return self.net(x)

tiny, history_tiny = train(TinyModel())
print("tiny test acc:", history_tiny[-1][2])

This model can only draw a slight curve. Test accuracy will plateau around 80% — the model underfits.

Overfitting — too big

py title="capstone.py"

class BigModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2, 256), nn.ReLU(),
            nn.Linear(256, 256), nn.ReLU(),
            nn.Linear(256, 256), nn.ReLU(),
            nn.Linear(256, 2),
        )
    def forward(self, x): return self.net(x)

big, history_big = train(BigModel())
print("big train acc:", history_big[-1][1])
print("big test acc :", history_big[-1][2])

Train accuracy will reach 100%; test accuracy will be lower than the tiny model because the network started memorizing the training noise. This is overfitting.

Visualize both

Use the plot_decision_boundary function from earlier. The big model will draw bizarre wiggles around individual training points; the tiny model will draw an almost-straight line.

Sweet spot — moderate model + regularization

py title="capstone.py"

class GoodModel(nn.Module):
    def __init__(self, hidden=16, p=0.2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2, hidden), nn.ReLU(), nn.Dropout(p),
            nn.Linear(hidden, hidden), nn.ReLU(), nn.Dropout(p),
            nn.Linear(hidden, 2),
        )
    def forward(self, x): return self.net(x)

good, history_good = train(GoodModel(), weight_decay=1e-4)
print("good train acc:", history_good[-1][1])
print("good test acc :", history_good[-1][2])

Two regularizers at work:

Dropout randomly zeros some activations during training, forcing the network to be redundant.
Weight decay (weight_decay in the optimizer) penalizes large weights — the model prefers the simplest function that fits.

Test accuracy should match or beat the big model with much less wiggling.

Compare the three

py title="capstone.py"

fig, axes = plt.subplots(1, 3, figsize=(14, 4))
for ax, (m, name) in zip(axes, [(tiny, "underfit"), (big, "overfit"), (good, "good")]):
    plt.sca(ax)
    # ... call your plot_decision_boundary or copy the contourf code here
    ax.set_title(name)
plt.show()

Stretch goals

Sweep hidden from 2 to 256 and plot final test accuracy. Where is the sweet spot?
Sweep weight_decay from 0 to 1e-2 for the big model. Does it close the gap?
Add early stopping: keep the parameters that gave the best test accuracy, not the final ones.

Recap

Classification = predict a discrete label. Output logits, one per class.
Loss: CrossEntropyLoss for multi-class, BCEWithLogitsLoss for binary.
Targets are class indices (long) for cross-entropy, floats for BCE.
Always plot the decision boundary — metrics hide what the model actually learned.
Underfitting: the model is too small for the data.
Overfitting: the model memorizes noise. Counter with smaller capacity, dropout, weight decay, more data.

The next chapter, Vision, applies the same ideas to images with convolutional networks.

Classification

#Goal of the lesson

#Suggested timing

#Regression vs. classification

#Setup

#Multi-class — the blobs dataset

#Try it — change the dataset

#Build a model

#From logits to predictions

#Loss, optimizer, accuracy

#Train

#Plot loss and accuracy

#Decision boundary

#Try it — kill the activations

#Binary classification

#Real-world example — heart-disease prediction

#Setup

#Loading the data

#Preparing the data

#Defining the model

#Training

#Inference on a new patient

#Try it — deliberate leakage

#Exercises

#Warm-up

#Generalization

#Confusion matrix

#Binary classification

#Heart disease

#Capstone — overfitting on the moons dataset

#Build the data

#A reusable training function

#Underfitting — too small

#Overfitting — too big

#Visualize both

#Sweet spot — moderate model + regularization

#Compare the three

#Stretch goals

#Recap

#References

Goal of the lesson

Suggested timing

Regression vs. classification

Setup

Multi-class — the blobs dataset

Try it — change the dataset

Build a model

From logits to predictions

Loss, optimizer, accuracy

Train

Plot loss and accuracy

Decision boundary

Try it — kill the activations

Binary classification

Real-world example — heart-disease prediction

Setup

Loading the data

Preparing the data

Defining the model

Training

Inference on a new patient

Try it — deliberate leakage

Exercises

Warm-up

Generalization

Confusion matrix

Binary classification

Heart disease

Capstone — overfitting on the moons dataset

Build the data

A reusable training function

Underfitting — too small

Overfitting — too big

Visualize both

Sweet spot — moderate model + regularization

Compare the three

Stretch goals

Recap

References