import matplotlib.pyplot as plt
import numpy as np
import torch
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from torch import nn
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42)
What does cluster_std=0.5 look like? cluster_std=4.0?
Show solution
With cluster_std=0.5 the blobs are tight and trivially separable. With 4.0 they overlap heavily and even a perfect classifier can’t get 100% accuracy because the labels themselves disagree in the overlap region.
The output layer has one unit per class. We deliberately leave it without an activation — those raw outputs are called logits. nn.CrossEntropyLoss applies LogSoftmax internally and is numerically more stable than computing the softmax ourselves.
Loss tells the optimizer how to improve. Accuracy tells us how well the model is doing in human terms. They almost always disagree slightly because cross-entropy penalizes overconfident wrong answers more than confident-correct ones.
Comment out the nn.ReLU() lines, retrain from scratch, and plot the decision boundary again. What changes?
Show solution
Without non-linearities the network collapses to a single linear transformation, no matter how many layers it has. The decision boundaries become straight lines and accuracy drops on data that needs curved separators. Activations are what make a “deep” network actually deep.
Synthetic blobs and circles are good for understanding what the model does. Now we move to real tabular data: predict whether a patient has heart disease from a small set of clinical features.
The dataset is provided by the Cleveland Clinic Foundation: 303 rows, 13 features, one binary target.
Feature
Type
Meaning
age
numerical
Age in years
sex
categorical
0 = female, 1 = male
cp
categorical
Chest-pain type (1–4)
trestbps
numerical
Resting blood pressure
chol
numerical
Serum cholesterol
fbs
categorical
Fasting blood sugar > 120 mg/dl
restecg
categorical
Resting ECG results
thalach
numerical
Maximum heart rate achieved
exang
categorical
Exercise-induced angina
oldpeak
numerical
ST depression induced by exercise
slope
numerical
Slope of the peak exercise ST segment
ca
categorical
Number of major vessels (0–3)
thal
categorical
normal, fixed, or reversible
target
binary
1 = heart disease, 0 = no heart disease
This is a much more realistic setup than 2-D toy data. You will learn:
how to load tabular data with pandas,
how to preprocess mixed numerical + categorical features,
how to wrap tensors in a TensorDataset and iterate them with a DataLoader,
how to run inference on a single new patient through the same pipeline.
import pandas as pd
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42)
The first thing to do with any new dataset is to look at it. df.head() shows the first rows; value_counts() checks for class imbalance. The Cleveland set is roughly balanced (165 vs. 138).
Mixed-type tabular data needs two preprocessing steps:
Numerical features get standardized (zero mean, unit variance) so the network’s gradients don’t get distorted by columns with very different scales.
Categorical features get integer-encoded so they’re representable as numbers.
The cardinal rule: fit the preprocessors on the training set only, then transform both train and test. Otherwise statistics from the test set leak into training and your reported accuracy is optimistic.
A batch is a subset of samples used in a single training iteration. Batched training is faster and gives smoother gradients. shuffle=True on the training loader prevents the model from memorising sample order.
A small two-layer multilayer perceptron (MLP) with dropout. The output is a single logit per sample — BCEWithLogitsLoss will turn it into a probability internally.
Layer
Purpose
nn.Linear(13, 32)
Linear transform from 13 features to 32 hidden units
nn.ReLU
Non-linearity
nn.Dropout(0.5)
Regularization — drops 50% of activations during training
You should reach roughly 80–85% test accuracy. A few percent of variance between runs is normal — there are only ~60 test patients.
In each iteration:
optimizer.zero_grad() clears gradients accumulated on parameters.
model(inputs) runs a forward pass.
loss.backward() computes gradients via backpropagation.
optimizer.step() updates the parameters.
model.train() enables training-mode behavior (Dropout active); model.eval() switches it off so evaluation is deterministic.
Note
With only ~240 training examples, dropout makes a real difference. Try setting nn.Dropout(0.0) and re-run — train accuracy will reach ~100% while test accuracy stays put. That’s textbook overfitting.
The scaler and encoders objects must be saved alongside the model. Predicting later without them is a bug. Use pickle or joblib.
LabelEncoder.transform raises if it sees a value it didn’t see during training. Real systems handle this — for example by mapping unseen categories to a special “unknown” index.
The output is a probability, not a diagnosis. Threshold at 0.5 for a default decision; choose a different threshold to trade off false positives vs. false negatives depending on cost.
Move the scaler.fit_transform call to be applied to the whole dataframe before train_test_split. Re-run training. What happens to the test accuracy, and why is the result misleading?
Show solution
Test accuracy goes up slightly because the scaler now “knows” the distribution of the test set. In a real deployment you don’t have the test set yet — only training data. The leaked statistics make the offline number look better than what you’d actually see in production. Always fit preprocessors on training data only.
Reduce the training set to n_samples=50 (with cluster_std=2.0). Train and evaluate. Now repeat with n_samples=2000. What changes about the decision boundary?
Train a model with no hidden layers — nn.Linear(2, 4) directly. What’s the maximum accuracy you can reach? On what kinds of datasets is it enough?
Build a confusion matrix on the heart-disease test set: how many false positives and false negatives does the model produce?
Try thresholds other than 0.5. Plot precision and recall as functions of the threshold (probs > t for t between 0.1 and 0.9). Which threshold would you pick if a false negative is twice as costly as a false positive?
Add a second hidden layer to HeartModel. Does it help? Why might it not?
Pickle the trained model and the scaler + encoders to a single file with joblib.dump({"model": model.state_dict(), "scaler": scaler, "encoders": encoders}, "heart.pkl"). Load them in a fresh script and predict on the same sample.
import matplotlib.pyplot as plt
import numpy as np
import torch
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from torch import nn
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(0)
x_np, y_np = make_moons(n_samples=300, noise=0.30, random_state=0)
x = torch.from_numpy(x_np).float()
y = torch.from_numpy(y_np).long()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
x_train, y_train = x_train.to(device), y_train.to(device)
x_test, y_test = x_test.to(device), y_test.to(device)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap=plt.cm.RdYlBu, s=10)
plt.title("moons (noisy)")
plt.show()
The data is deliberately noisy — some red points are inside the blue moon and vice versa. A perfect classifier on this data does not exist; the best we can do is recover the underlying shape.
Train accuracy will reach 100%; test accuracy will be lower than the tiny model because the network started memorizing the training noise. This is overfitting.
Use the plot_decision_boundary function from earlier. The big model will draw bizarre wiggles around individual training points; the tiny model will draw an almost-straight line.