Homework 4: Generalizaiton and regularization¶
0 - Introduction to PyTorch and Autograd¶
In Week 2, we introduced the concept of a computational graph
, which is used to efficiently compute gradients in deep neural networks. The key idea is to track intermediate values during forward computation, as these values are used again for backpropagation. By storing these intermediate values, we can significantly reduce the computational cost during gradient computation.
This concept of using a computational graph to compute gradients is implemented in PyTorch, originally developed by Facebook’s AI Research (FAIR) group (now part of Meta), through a process called automatic differentiation, also known as autograd
.
Similar to numpy
, in PyTorch, we define variables as tensors. PyTorch tensors have an additional feature: if requires_grad=True
, they automatically track operations for gradient computation.
import torch
torch.manual_seed(0) # seed for reproducing
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
z = x ** 2 + 2*y ** 2
print(f"x.requires_grad: {x.requires_grad}")
print(f"y.requires_grad: {y.requires_grad}")
print(f"z.requires_grad: {z.requires_grad}")
x.requires_grad: True y.requires_grad: True z.requires_grad: True
In this example:
x
andy
are tensors withrequires_grad=True
, meaning they are being tracked for gradient computation.- The result,
z
, also hasrequires_grad=True
, since it's the result of an operation involving tensors that require gradients.
We define two tensor variables, x
and y
, and set requires_grad=True
to indicate that PyTorch should track their gradients. This means that PyTorch will construct a computational graph involving x
and y
whenever operations are performed on them.
In this example, we use x
and y
to compute a
. To compute the gradients of z
with respect to x
and y
, we simply call z.backward()
. This applies the chain rule, propagating the gradients backward through the computational graph. The gradients of z
with respect to x
and y
can then be accessed by calling x.grad
and y.grad
.
z.backward()
print(f"Gradient of x: {x.grad}")
print(f"Gradient of y: {y.grad}")
Gradient of x: 2.0 Gradient of y: 8.0
We can also visualize the computational graph by inspecting z.grad_fn
, which shows the function that created z
. This is useful for understanding how PyTorch constructs the graph and tracks operations. Each operation is represented as a node in the graph, and by calling grad_fn
, we can see the chain of operations that led to the final result, allowing us to trace the graph backward for gradient computation.
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
z = x ** 2 + 2*y ** 2
a = torch.sin(z)
print(f"a.requires_grad: {a.requires_grad}")
print(a.grad_fn)
print(a.grad_fn.next_functions)
print(z.grad_fn)
print(x.grad_fn)
a.requires_grad: True <SinBackward0 object at 0x7fcd1f42fdc0> ((<AddBackward0 object at 0x7fcd1f42e5f0>, 0),) <AddBackward0 object at 0x7fcd1f42fdc0> None
In this example:
print(a.grad_fn)
: Shows the function that createda
, which will beSinBackward
, indicating a was generated by thesin
operation.print(a.grad_fn.next_functions)
: Displays the previous functions in the graph that led toa
. This shows the operations applied toz
beforesin
.print(z.grad_fn)
: Prints the function that createdz
(i.e.,AddBackward
), which matchesa.grad_fn.next_functions
.print(x.grad_fn)
: This showsNone
becausex
is a leaf tensor, meaning it was not created by any function but was initialized directly. Leaf tensors do not have agrad_fn
.
To stop gradient tracking, we can use .detach()
to remove a tensor from the computational graph. In this example, we detach the tensor x
, creating x_detached
. When we call a.backward()
, the detached tensor will not have gradients, and its requires_grad
flag will be set to False
.
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
z = x ** 2 + 2*y ** 2
x_detached = x.detach()
z.backward()
print(f"x.requires_grad: {x.requires_grad}")
print(f"x_detached.requires_grad: {x_detached.requires_grad}")
print(f"x.grad: {x.grad}")
print(f"x_detached.grad: {x_detached.grad}")
x.requires_grad: True x_detached.requires_grad: False x.grad: 2.0 x_detached.grad: None
However, it's important to note that x_detached
is not a deep copy of x
. Both tensors share the same memory location, but x_detached
is excluded from the computational graph used for automatic differentiation. This means if you modify x_detached
in-place, the value of x
may also change.
To avoid this, if you want to modify x_detached
without affecting the original x
, you can use x.clone()
to create an actual copy of the tensor.
To reduce this probalic senario, if you want to modify the value of x_detached
but dont want to effect the value in the original x
. You can use x.clone()
to get a copy of x
.
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
z = x ** 2 + 2*y ** 2
x_clone = x.clone()
x_clone = torch.tensor(5.0)
print(f"x_clone: {x_clone}")
print(f"x: {x}")
print(f"x_clone.requires_grad: {x_clone.requires_grad}")
x_clone: 5.0 x: 1.0 x_clone.requires_grad: False
In PyTorch, when you use the clone()
method, the cloned tensor retains the same requires_grad
setting as the original tensor. So if the original tensor x
has requires_grad=True
, the cloned tensor x_clone
will also have requires_grad=True
. This is why, in your example, x_clone.requires_grad
is True
.
If you want the cloned tensor to not track gradients, you can explicitly set requires_grad=False
after cloning:
x_clone = x.clone().detach()
print(f"x_clone.requires_grad: {x_clone.requires_grad}")
x_clone.requires_grad: False
Another way to temporarily prevent gradient tracking when you only need to use the current value of x
for some computation is to use torch.no_grad()
. This context manager disables gradient tracking for all operations inside its block, making it useful when you want to perform computations without affecting the computational graph.
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
z = x ** 2 + 2 * y ** 2
with torch.no_grad():
a = torch.sin(x)
print(f"a.requires_grad: {a.requires_grad}")
a.requires_grad: False
After exiting the no_grad()
block, a no longer tracks gradients, as confirmed by a.requires_grad
being False
. This is useful when you want to do computations that don’t need gradients, such as during the inference phase or when updating non-trainable parameters.
1 - Build a Neural Network and Train using autograd
¶
We have briefly introduced how autograd
uses computational graphs to efficiently compute gradients. Now, let's use autograd
to train a deep neural network (DNN).
1.1 - Define ShallowNet
[10/10]¶
The first step is to define a two-layer neural network class called ShallowNet
using torch.nn
.
Exercise 1 [10/10]
- The
ShallowNet
takes input sizen_x
, output sizen_y
, and widthn_h
. - Instead of manually specifying weights and biases, use
nn.Linear()
to define linear transformations to compute preactivation that include weights and biases internally. - Define the activation function
self.act
as a class attribute usingnn.ReLU()
. - Implement the
forward()
method to perform forward propagation and return the output of the network.
Note that, thanks to autograd()
, you don't need to manually implement backpropagation as you did before.
import torch.nn as nn
class ShallowNet(nn.Module):
def __init__(self, n_x, n_h, n_y):
super(ShallowNet, self).__init__()
self.fc1 = nn.Linear(n_x, n_h)
self.fc2 = nn.Linear(n_h, n_y)
self.act = nn.ReLU()
def forward(self, x):
# Forward pass: x -> linear() -> act() -> linear()
### Code Here ###
### Code Here ###
return x
# Example of how to initialize the network
n_x = 4 # Input size
n_h = 10 # Hidden layer width
n_y = 1 # Output size
model = ShallowNet(n_x, n_h, n_y)
print(model)
ShallowNet( (fc1): Linear(in_features=4, out_features=10, bias=True) (fc2): Linear(in_features=10, out_features=1, bias=True) (act): ReLU() )
num_samples = 8
x = torch.randn(num_samples, n_x)
output = model(x)
print(output)
tensor([[ 0.3817], [ 0.3093], [ 0.2687], [ 0.1068], [-0.0068], [ 0.0091], [ 0.1560], [ 0.1721]], grad_fn=<AddmmBackward0>)
Explanation:
nn.Linear(n_x, n_h)
: Defines a fully connected layer from input sizen_x
to hidden layer sizen_h
.self.act = nn.ReLU()
: Defines the ReLU activation function.forward()
method: Implements the forward propagation, applying the activation function after the first layer and passing the result through the second layer to produce the output.
You can now use autograd()
to automatically handle backpropagation during training.
1.2 - Define Loss Function and Optimizer [10/10]¶
To train the ShallowNet
, we need to specify both a loss function and an optimizer. For example, we can use nn.MSELoss()
as the loss function, and torch.optim.SGD()
as the optimizer. The optimizer requires model.parameters()
and a learning rate, which we define as learning_rate
.
# Define the loss function
criterion = nn.MSELoss() #
# Define the optimizer (SGD)
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
Next, let's train this simple neural network using full-batch gradient descent, where we use all the training data in each step, instead of mini-batches.
Exercise 2 [10/10]:
- In each training loop, evaluate the
model
on the entire training setx
to obtain theoutput
. - Compute the
loss
by using the predefined loss functioncriterion
and comparing theoutput
with the ground truth labelsy
. - Use autograd to compute the gradients by calling
loss.backward()
instead of callingmodel.backward()
like we did in the last assignment - The computed gradients are automatically stored in the
model
’s parameters, which are passed into theoptimizer
. Apply a gradient descent update by callingoptimizer.step()
as done in the previous assignment.
# Generate random data (full-batch)
torch.manual_seed(0) # For reproduce
x = torch.randn(num_samples, n_x)
y = torch.randn(num_samples, 1)
num_epochs = 10
for epoch in range(num_epochs):
optimizer.zero_grad() # Clear previous gradients
# Forward pass
### Code Here ###
### Code Here ###
# Backward pass and optimization
### Code Here ###
### Code Here ###
# Print the loss
print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}")
Epoch [1/10], Loss: 0.9464 Epoch [2/10], Loss: 0.9357 Epoch [3/10], Loss: 0.9255 Epoch [4/10], Loss: 0.9158 Epoch [5/10], Loss: 0.9066 Epoch [6/10], Loss: 0.8978 Epoch [7/10], Loss: 0.8893 Epoch [8/10], Loss: 0.8812 Epoch [9/10], Loss: 0.8734 Epoch [10/10], Loss: 0.8660
Note:
- In PyTorch, when using
torch.optim.SGD()
, gradient descent (GD) is automatically implemented if you pass the entire dataset as a single batch (like in the example above). - If you do not specify a batch size while loading data (e.g., using
DataLoader
), it defaults to using the full dataset, performing full-batch gradient descent. - To use mini-batch stochastic gradient descent (SGD), you need to load your data in mini-batches using a DataLoader with a specified batch_size.
1.3 - Use DataLoader
for Mini-Batch SGD [20/20]¶
In this exercise, we will modify the training loop to use mini-batch stochastic gradient descent (SGD) by loading the data in mini-batches using PyTorch's DataLoader
.
Exercise 3 [10/10]:
- Generate random data
x
andy
as before. - In PyTorch, use
TensorDataset(x, y)
to wrap the input tensorx
and label tensory
into a dataset. Ensure that the first dimension (number of samples) matches for both tensors. - PyTorch’s
DataLoader
can automatically divide a dataset into mini-batches and shuffle the data after each epoch whenshuffle=True
is set. - Set the
batch_size=64
and shuffle=True when creating the data_loader.
import torch
from torch.utils.data import DataLoader, TensorDataset
# Generate random data (as tensors)
num_samples = 1000
n_x = 10 # Number of features
n_y = 1 # Number of output features
x = torch.randn(num_samples, n_x) # Input data
y = torch.randn(num_samples, 1) # Target output
# Create a TensorDataset and DataLoader for mini-batch processing
dataset = TensorDataset(x, y)
batch_size = 32
data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
Exercise 4 [10/10]: Now that we have the mini-batches prepared, we can use SGD with mini-batches to train the ShallowNet
.
- Define a
ShallowNet
with a hidden layer widthn_h=10
. - Specifiy the loss
criterion
usingMSELoss()
- Define the SGD
optimizer
usingtorch.optim.SGD
- For each mini-batch,
data_loader
return a(inputs, targets)
pair.
- We evaluate our
model
on theinputs
to getoutputs
- Then compute the
loss
usingoutputs
andtargets
- Compute the gradients by calling
loss.backward()
- Employ the gradient update using
optimizer.step()
torch.random.manual_seed(0) # For reproduce
# Define the model, loss function, and optimizer
n_h = 10 # Hidden layer width
model = ShallowNet(n_x=n_x, n_h=n_h, n_y=n_y)
criterion = torch.nn.MSELoss()
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
# Training loop with mini-batch SGD
num_epochs = 10
for epoch in range(num_epochs):
running_loss = 0.0
# Loop over mini-batches
### Code Here ###
for
### Code Here ###
optimizer.zero_grad() # Clear previous gradients
# Forward pass: compute the model's output for the mini-batch
### Code Here ###
### Code Here ###
# Backward pass and optimization
### Code Here ###
### Code Here ###
# Accumulate loss for printing
running_loss += loss.item()
# Print the average loss for the current epoch
avg_loss = running_loss / len(data_loader)
print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {avg_loss:.4f}")
Epoch [1/10], Loss: 1.1037 Epoch [2/10], Loss: 1.0542 Epoch [3/10], Loss: 1.0580 Epoch [4/10], Loss: 1.0350 Epoch [5/10], Loss: 1.0737 Epoch [6/10], Loss: 1.0209 Epoch [7/10], Loss: 1.0275 Epoch [8/10], Loss: 1.0241 Epoch [9/10], Loss: 1.0418 Epoch [10/10], Loss: 1.0044
Note:
running_loss
is used to keep track of the total loss across all mini-batches within each epoch.avg_loss
computes the average loss over the entire epochoptimizer.zero_grad()
clears the previously computed gradients as, in PyTorch, gradients are accumulated over mini-batches by default.len(data_loader)
gives the number of mini-batches in the dataset:num_samples
divided bybatch_size
. In our case, $\lceil 1000/32\rceil = 32$- The total number of training iterations (loops) is calculated by multiplying the
num_epochs
by the number of mini-batches per epoch. In our example, we have $10 \times 32 = 320$ training loops
1.4 - Train on a realistic dataset using Pytorch [40/40]¶
Next, let us load the MNIST data set. This time we will use torch
.
import torch
import torchvision
train_dataset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=torchvision.transforms.ToTensor())
test_dataset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=torchvision.transforms.ToTensor())
print(f"Number of training samples: {len(train_dataset)}")
print(f"Number of test samples: {len(test_dataset)}")
print(f"Input shape: {train_dataset[0][0].shape}")
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Failed to download (trying next): <urlopen error [Errno 111] Connection refused> Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████| 9.91M/9.91M [00:02<00:00, 4.12MB/s]
Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Failed to download (trying next): <urlopen error [Errno 111] Connection refused> Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz
100%|██████████| 28.9k/28.9k [00:00<00:00, 132kB/s]
Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Failed to download (trying next): <urlopen error [Errno 111] Connection refused> Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz
100%|██████████| 1.65M/1.65M [00:01<00:00, 1.26MB/s]
Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz Failed to download (trying next): <urlopen error [Errno 111] Connection refused> Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|██████████| 4.54k/4.54k [00:00<00:00, 4.22MB/s]
Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw Number of training samples: 60000 Number of test samples: 10000 Input shape: torch.Size([1, 28, 28])
It is generally recommended to get familiar or get a sense of the dataset before working on the machine learning task. As MNIST is an image dataset, it would be nice just print the image and take a look using the following function.
import matplotlib.pyplot as plt
def plot_images(dataset, num_images=5):
random_indices = torch.randperm(len(dataset))[:num_images] # Randomly shuffle indices and pick num_images
fig, axes = plt.subplots(1, num_images, figsize=(12, 4))
for i, idx in enumerate(random_indices):
# Get image and label
image, label = dataset[idx]
if image.ndim == 3: # Check if the image has 3 channels (RGB)
image = image.permute(1, 2, 0) # Rearrange dimensions to (Height, Width, Channels)
else:
image = image.squeeze() # Remove the channel dimension for plotting (1x28x28 -> 28x28)
# Plot the image
axes[i].imshow(image, cmap='gray')
axes[i].set_title(f"Label: {label}")
axes[i].axis('off') # Turn off the axis
plt.show()
# Plot the random 5 images from `train_data_full`
plot_images(train_dataset, num_images=5)
Exercise 5 [10/10]: To apply SGD with mini-batch, we need DataLoader
to load the data as mini-batches. Additionally, to moniter the performance not only during the training but also the test dataset (or validation dataset for tuning), we need both train_loader
and test_loader
.
- Define
batch_size
with size value64
- Define
train_loader
usingDataLoader
, ensuring theshuffle=True
flag is set to randomly shuffle the training data after each epoch. - Define
test_loader
usingDataLoader
for the test dataset, and since it’s not necessary to shuffle the test data, you can leave the shuffle flag as the default (False).
# Create DataLoaders
### Code Here ###
### Code Here ###
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
Now that we have the training and test datasets ready in mini-batch form, we can define our model
, loss criterion
, and optimizer
to start training.
Exercise 6 [10/10]:
- Use
ShallowNet
to define a two-layer networkmodel
with10
hidden units andn_y=10
as MNIST contains10
digits. - Specify the loss critarion to be
nn.MSELoss()
- Define SGD optimizer using
torch.optim.SGD()
by passing the learning rate and the model's parameters - For each epoch, we loop over each mini-batch
(inputs, targets)
loaded fromtrain_loader
:
- Since each image in
inputs
has shape28x28
, but the MLP treats input as a vector, reshape inputs using.view(-1)
to flatten the images. - Compute the
outputs
by passing theinputs
through themodel
- The model is designed for
10
output classes, but targets are integer values (class labels). Convert targets to One-Hot Encoding usingtorch.nn.functional.one_hot(targets, num_classes=10)
. - Compute the
train_loss
usingcritarion
onoutputs
andtargets
- Accumulate the
train_loss
intotrain_running_loss
- Perform gradient computation and take the optimization step to update model's parameters.
- Apply the same steps to
test_loader
to computetest_loss
and updatetest_running_loss
, excpet for skipping gradient computaiton and optimization steps
- At end of each epoch, store the averaged
train_running_loss
andtest_running_loss
intotrain_losses
andtest_losses
, respecitvely.
from torch.nn.functional import one_hot
torch.manual_seed(0) # For reproduce
n_y = 10 # 10 classes
model = ShallowNet(n_x=28*28, n_h=10, n_y=n_y)
criterion = nn.MSELoss()
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
train_losses = []
test_losses = []
num_epochs = 10
for epoch in range(num_epochs):
train_running_loss = 0.0
test_running_loss = 0.0
# Test loop
model.eval() # Set the model to evaluation mode
with torch.no_grad():
for inputs, targets in test_loader:
### Code Here ###
### Code Here ###
outputs = model(inputs)
test_loss = criterion(outputs, targets)
test_running_loss += test_loss.item()
# Training loop
model.train() # Set the model to training mode
for i, (inputs, targets) in enumerate(train_loader):
optimizer.zero_grad() # Clear previous gradients
### Code Here ###
### Code Here ###
outputs = model(inputs)
train_loss = criterion(outputs, targets)
train_running_loss += train_loss.item()
train_loss.backward()
optimizer.step()
train_losses.append(train_running_loss / len(train_loader))
test_losses.append(test_running_loss / len(test_loader))
print(f"Epoch [{epoch + 1}/{num_epochs}], Train Loss: {train_losses[-1]:.4f}, Test Loss: {test_losses[-1]:.4f}")
Epoch [1/10], Train Loss: 0.0874, Test Loss: 0.1305 Epoch [2/10], Train Loss: 0.0730, Test Loss: 0.0769 Epoch [3/10], Train Loss: 0.0661, Test Loss: 0.0688 Epoch [4/10], Train Loss: 0.0612, Test Loss: 0.0629 Epoch [5/10], Train Loss: 0.0575, Test Loss: 0.0587 Epoch [6/10], Train Loss: 0.0546, Test Loss: 0.0554 Epoch [7/10], Train Loss: 0.0525, Test Loss: 0.0529 Epoch [8/10], Train Loss: 0.0508, Test Loss: 0.0510 Epoch [9/10], Train Loss: 0.0495, Test Loss: 0.0495 Epoch [10/10], Train Loss: 0.0485, Test Loss: 0.0484
Now that we have obtained train_losses
and test_losses
, it is essential to check the convergence behavior to determine whether it is smooth without significant damping or oscillation. If oscillations occur, we may need to consider shrinking the learning rate. In this exercise, you will implement a function plot_losses()
to visualize the losses on the same plot.
Exercise 7 [10/10]:
- We need to import
matplotlib.pyplot
asplt
as for ploting - The function
plot_losses
takestrain_losses
andtest_losses
as input - Use
plt.figure()
to define a figure with specifiedfigsize=(10,6)
- Plot train_losses using
plt.plot()
. To distinguish it fromtest_losses
, specify parameters such aslabel
,color
, andlinestyle
. - Plot
test_losses
as thetrain_losses
- Add labels and title
- Add legend
- Finish ploting, we
plt.show()
the plotted figure.
import matplotlib.pyplot as plt
def plot_losses(train_losses, test_losses):
plt.figure(figsize=(10, 6))
# Plot training losses (blue dashed line)
plt.plot(train_losses, label='Training Loss', color='blue', linestyle='--')
# Plot test losses (red solid line)
### Code Here ###
### Code Here ###
# Add labels and title
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training and Test Loss Over Epochs')
# Add a legend
plt.legend()
# Show the plot
plt.show()
# Plot the train and test losses
plot_losses(train_losses, test_losses)
Once the model is trained, we can use it to make predictions and calculate the accuracy.
Exercise 8 [10/10]:
- The
predict_and_accuracy
takes amodel
anddata_loader
- Loop over the
data_loader
, similar to how you handled the test loop:
- For each mini-batch, compute the
outputs
by applying themodel
oninputs
. - Use
torch.max(outputs.data, 1)
to retrieve thepredicted
class labels by finding the index of the maximum logit value inoutputs
. - We collect the
predicted
labels into the previous collectionpredicitons
usingextend()
as to append alist
to anotherlist
in Python - UUpdate the total number of predictions made (
total
) and the total number of correct predictions (correct
).
- After completing the loop, compute the
accuracy
as the ratio of correct predictions to the total number of predictions (correct / total
). - Return total
predicitons
andaccuracy
def predict_and_accuracy(model, data_loader):
model.eval() # Set the model to evaluation mode
correct = 0
total = 0
predictions = []
with torch.no_grad(): # Disable gradient computation during prediction
for inputs, targets in data_loader:
inputs = inputs.view(-1, 28 * 28) # Flatten the images
outputs = model(inputs)
### Code Here ###
### Code Here ###
predictions.extend(predicted.tolist())
total += targets.size(0)
correct += (predicted == targets).sum().item()
### Code Here ###
### Code Here ###
return predictions, accuracy
# Train Accuracy:
predictions, accuracy = predict_and_accuracy(model, train_loader)
print(f"Training Accuracy:: {accuracy:.4f}")
Training Accuracy:: 0.8137
# Test Accuracy
predictions, accuracy = predict_and_accuracy(model, test_loader)
print(f"Testing Accuracy: {accuracy:.4f}")
Testing Accuracy: 0.8259
2 - Tuning Hyperparameters¶
As introduced in the lectures, training deep neural networks (DNNs) involves a variety of choices for hyperparameters, such as:
- chocie of loss funcitons,
- activaiton funcitons,
- leanring rate,
- optimizers,
- network achetecrures,
- etc
In this seciton, we will review the tunning strategies introduced in lectures.
2.1 - Train with Cross Entropy Loss [10/10]¶
Let’s first explore the effect of changing the loss function from mean square error to cross-entropy loss. Specifically, we want to check if switching the loss function results in significant differences in training performance.
Note: When we define ShallowNet
, the model outputs raw real values (logits). However, when using cross-entropy loss, the loss function expects the input to represent probabilities, where each value is in $[0,1]$ and the sum is euqal to $1$. This is achieved by applying the softmax activation function: $$ b_i = \frac{\exp(a_i)}{\sum_{j=1}^{n} \exp(a_i)} $$ However, do not apply softmax manually in the ShallowNet
definition, because nn.CrossEntropyLoss()
automatically applies softmax internally. Applying softmax twice would result in incorrect behavior and training.
Exercise 9 [10/10]:
- Define
model
asShallowNet
with10
hidden units. - Specify the loss
criterion
asnn.CrossEntropyLoss()
. - Set the
optimizer
to SGD (with mini-batches), using a learning rate of0.01
. - Train the
mode
l for10
epochs, recording thetrain_losses
andtest_losses
at each epoch. - Plot the
train_losses
andtest_losses
to visualize the convergence behavior. - Make predictions on both the training and test datasets, and return their respective
accuracy
.
torch.manual_seed(0) # For reproduce
model = ShallowNet(n_x=28*28, n_h=8, n_y=n_y)
### Code Here ### (Define loss criterion)
### Code Here ###
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
train_losses = []
test_losses = []
num_epochs = 10
for epoch in range(num_epochs):
train_running_loss = 0.0
test_running_loss = 0.0
# Test loop
model.eval()
with torch.no_grad():
for inputs, targets in test_loader:
### Code Here ###
### Code Here ###
test_running_loss += test_loss.item()
# Training loop
model.train()
for inputs, targets in train_loader:
optimizer.zero_grad() # Clear previous gradients
### Code Here ###
### Code Here ###
train_losses.append(train_running_loss / len(train_loader))
test_losses.append(test_running_loss / len(test_loader))
print(f"Epoch [{epoch + 1}/{num_epochs}], Train Loss: {train_losses[-1]:.4f}, Test Loss: {test_losses[-1]:.4f}")
Epoch [1/10], Train Loss: 1.3421, Test Loss: 2.3165 Epoch [2/10], Train Loss: 0.5694, Test Loss: 0.6852 Epoch [3/10], Train Loss: 0.4494, Test Loss: 0.4663 Epoch [4/10], Train Loss: 0.4003, Test Loss: 0.4021 Epoch [5/10], Train Loss: 0.3716, Test Loss: 0.3670 Epoch [6/10], Train Loss: 0.3530, Test Loss: 0.3458 Epoch [7/10], Train Loss: 0.3402, Test Loss: 0.3344 Epoch [8/10], Train Loss: 0.3308, Test Loss: 0.3231 Epoch [9/10], Train Loss: 0.3237, Test Loss: 0.3176 Epoch [10/10], Train Loss: 0.3177, Test Loss: 0.3128
# Plot the train and test losses
### Code Here ###
plot_losses(train_losses, test_losses)
### Code Here ###
### Code Here ###
_, train_accuracy = predict_and_accuracy(model, train_loader)
_, test_accuracy = predict_and_accuracy(model, test_loader)
### Code Here ###
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Testing Accuracy: {test_accuracy:.4f}")
Training Accuracy: 0.9115 Testing Accuracy: 0.9132
As you can see, using CrossEntropyLoss()
results in beter train_losses
and test_losses
compared to using MSELoss()
, but the accuracy
is much better. Hence, the chocie of hyperparameters influence the performance of DNNs.
2.2 - Tune network width and learning rate [10/10]¶
We will use random search to tune two hyperparameters: the width (number of hidden units) and the learning rate.
The possible combinations of learning rate and network width can be represented in a grid as follows:
Learning Rate \ Width | 2 | 16 | 128 | 1024 |
---|---|---|---|---|
1 | val. | val. | val. | val. |
0.1 | val. | val. | val. | val. |
0.01 | val. | val. | val. | val. |
0.001 | val. | val. | val. | val. |
Note: both learning rate and width are in log scale.
Exercise 10 [10/10]: The first step is to split the train_dataset
into train_data
and val_data
. Since the MNIST dataset is relatively straightforward, the results are not very sensitive to the choice of hyperparameters. Therefore, we can use a 90/10
split ratio by applying the random_split(data, [size1, size2])
function.
Next, to simplify the coding process, we will write the training loop as a function train_model()
that takes model
, train_loader
, loss criterion
, optimizer
and num_epoches
as input.
In the third step, we need to define the search space. Create a range of values for both the learning rate and network width, ensuring that both are sampled using log-scale. Next, randomly sample 4
learning rates and widths from the defined search space to evaluate different combinations.
For each randomly selected combination of learning rate and width, train the model on the training set and validate it on the validation set. Track the performance (e.g., validation accuracy) for comparison.
from torch.utils.data import random_split
train_size = int(0.9 * len(train_dataset))
val_size = len(train_dataset) - train_size
### Code Here ### (random_split train_dataset into train_data and val_data)
### Code Here ###
print(f"Number of training samples: {len(train_data)}")
print(f"Number of validation samples: {len(val_data)}")
# Redefine valudation set loader
batch_size = 64
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_data, batch_size=batch_size, shuffle=False)
Number of training samples: 54000 Number of validation samples: 6000
def train_model(model, train_loader, criterion, optimizer, num_epochs=10):
for epoch in range(num_epochs):
running_loss = 0.0
for inputs, targets in train_loader:
### Code Here ###
### Code Here ###
print(f"Epoch [{epoch + 1}/{num_epochs}], Train Loss: {running_loss / len(train_loader):.4f}")
learning_rates = torch.logspace(-3, 0, 5)
widths = [2, 16, 128, 1024]
print(f"Learning Rate: {learning_rates}")
print(f"Width: {widths}")
Learning Rate: tensor([0.0010, 0.0056, 0.0316, 0.1778, 1.0000]) Width: [2, 16, 128, 1024]
import random
random.seed(4) # For reproducibility
# Sample random combinations of learning rates and widths
random_combinations = [(random.choice(learning_rates), random.choice(widths)) for _ in range(4)] # Example with 20 combinations
print(f"Random Combinations: {random_combinations}")
Random Combinations: [(tensor(0.0056), 128), (tensor(0.0010), 1024), (tensor(0.1778), 16), (tensor(0.0010), 2)]
best_params = None
best_val_accuracy = 0.0
for lr, width in random_combinations:
print(f"Training with Learning Rate = {lr:.6f}, Width = {width}")
torch.manual_seed(0) # For reproducibility
# Build and initialize the model with the given width
model = ShallowNet(n_x=28*28, n_h=width, n_y=n_y)
# Define the optimizer with the sampled learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
# Define loss criterion
criterion = nn.CrossEntropyLoss()
# Train the model on the training set
### Code Here ###
### Code Here ###
# Validate the model on the validation set
### Code Here ###
### Code Here ###
# Track the best-performing combination
### Code Here ###
if
### Code Here ###
best_val_accuracy = val_accuracy
best_params = (lr, width)
print(f"Best parameters found: Learning Rate = {best_params[0]}, Width = {best_params[1]}")
Training with Learning Rate = 0.005623, Width = 128 Epoch [1/10], Loss: 1.6581 Epoch [2/10], Loss: 0.7577 Epoch [3/10], Loss: 0.5312 Epoch [4/10], Loss: 0.4482 Epoch [5/10], Loss: 0.4049 Epoch [6/10], Loss: 0.3779 Epoch [7/10], Loss: 0.3588 Epoch [8/10], Loss: 0.3442 Epoch [9/10], Loss: 0.3323 Epoch [10/10], Loss: 0.3223 Training with Learning Rate = 0.001000, Width = 1024 Epoch [1/10], Loss: 2.1743 Epoch [2/10], Loss: 1.8915 Epoch [3/10], Loss: 1.5737 Epoch [4/10], Loss: 1.2786 Epoch [5/10], Loss: 1.0545 Epoch [6/10], Loss: 0.8986 Epoch [7/10], Loss: 0.7903 Epoch [8/10], Loss: 0.7127 Epoch [9/10], Loss: 0.6548 Epoch [10/10], Loss: 0.6100 Training with Learning Rate = 0.177828, Width = 16 Epoch [1/10], Loss: 0.4211 Epoch [2/10], Loss: 0.2501 Epoch [3/10], Loss: 0.2155 Epoch [4/10], Loss: 0.1946 Epoch [5/10], Loss: 0.1815 Epoch [6/10], Loss: 0.1732 Epoch [7/10], Loss: 0.1668 Epoch [8/10], Loss: 0.1619 Epoch [9/10], Loss: 0.1551 Epoch [10/10], Loss: 0.1495 Training with Learning Rate = 0.001000, Width = 2 Epoch [1/10], Loss: 2.2896 Epoch [2/10], Loss: 2.1741 Epoch [3/10], Loss: 2.0789 Epoch [4/10], Loss: 1.9932 Epoch [5/10], Loss: 1.9205 Epoch [6/10], Loss: 1.8617 Epoch [7/10], Loss: 1.8131 Epoch [8/10], Loss: 1.7711 Epoch [9/10], Loss: 1.7336 Epoch [10/10], Loss: 1.6996 Best parameters found: Learning Rate = 0.17782793939113617, Width = 16 Epoch [1/10], Loss: 0.4183 Epoch [2/10], Loss: 0.2366 Epoch [3/10], Loss: 0.2000 Epoch [4/10], Loss: 0.1784 Epoch [5/10], Loss: 0.1650 Epoch [6/10], Loss: 0.1533 Epoch [7/10], Loss: 0.1476 Epoch [8/10], Loss: 0.1402 Epoch [9/10], Loss: 0.1340 Epoch [10/10], Loss: 0.1309
# Train with the best parameters
torch.manual_seed(0)
model = ShallowNet(n_x=28*28, n_h=best_params[1], n_y=n_y)
optimizer = torch.optim.SGD(model.parameters(), lr=best_params[0])
train_model(model, train_loader, criterion, optimizer)
Epoch [1/10], Loss: 0.4211 Epoch [2/10], Loss: 0.2501 Epoch [3/10], Loss: 0.2155 Epoch [4/10], Loss: 0.1946 Epoch [5/10], Loss: 0.1815 Epoch [6/10], Loss: 0.1732 Epoch [7/10], Loss: 0.1668 Epoch [8/10], Loss: 0.1619 Epoch [9/10], Loss: 0.1551 Epoch [10/10], Loss: 0.1495
_, train_accuracy = predict_and_accuracy(model,train_loader)
_, val_accuracy = predict_and_accuracy(model,val_loader)
_, test_accuracy = predict_and_accuracy(model,test_loader)
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Testing Accuracy: {test_accuracy:.4f}")
Training Accuracy: 0.9624 Validation Accuracy: 0.9495 Testing Accuracy: 0.9538
3- Overparameterizaiton¶
In the lectures, we introduced the concept of overparameterization, where models are built with many more parameters or hidden units than necessary. Here, we define a shallow network with a large number of hidden units, specifically 2048, and use a high learning rate of 1.
Because we use a much larger width, the model complexity increases, leading to lower bias and potentially good performance on the training set. However, due to high variance, this may not translate to the test set.
Interestingly, the following experiments will demonstrate that an overparameterized model not only struggles to achieve good performance on the training set but can also generalize to the test set. Additionally, the model can be successfully trained even with a large learning rate.
torch.manual_seed(0) # for reproduct
width = 2048 # large width
model = ShallowNet(n_x=28*28, n_h=width, n_y=10)
learning_rate = 1.0
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()
print(f"Training with Learning Rate = {learning_rate}, Width = {width}")
train_model(model, train_loader, criterion, optimizer, num_epochs=10)
Training with Learning Rate = 1.0, Width = 2048 Epoch [1/10], Loss: 0.2541 Epoch [2/10], Loss: 0.0846 Epoch [3/10], Loss: 0.0558 Epoch [4/10], Loss: 0.0360 Epoch [5/10], Loss: 0.0240 Epoch [6/10], Loss: 0.0158 Epoch [7/10], Loss: 0.0094 Epoch [8/10], Loss: 0.0066 Epoch [9/10], Loss: 0.0038 Epoch [10/10], Loss: 0.0015
_, train_accuracy = predict_and_accuracy(model,train_loader)
_, val_accuracy = predict_and_accuracy(model,val_loader)
_, test_accuracy = predict_and_accuracy(model,test_loader)
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Validation Accuracy: {val_accuracy:.4f}")
print(f"Testing Accuracy: {test_accuracy:.4f}")
Training Accuracy: 1.0000 Validation Accuracy: 0.9852 Testing Accuracy: 0.9846