Homework 7: Recurrent Neural Networks¶
In this homework, you will implement a simple Recurrent Neural Network (RNN) that processes an input sequence $x=\{x_1, \cdots, x_T\}$ to generate an output sequence $y=\{y_1, \cdots, y_T\}$. The tasks includes
- Tanh and softmax
- Forward propogation
- Backpropogation through time
- RNN class
0 - Import Libraries and Tanh [1/1]¶
In this assingment, we only use numpy
to implement RNN and matplotlib
for plot.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1)
As discussed in the lectures, the Tanh activation function is commonly used in Natural Language Processing (NLP). Here, you’ll define a class Tanh
with two methods: forward()
for activation and derivative()
for computing gradients during backpropagation.
$$ \tanh(x) = \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} $$
Exercise [0.5/0.5]: Implement class Tanh
.
class Tanh:
def __init__(self):
pass
def forward(self, x):
### Code Here ###
### Code Here ###
def derivative(self, x):
### Code Here ###
### Code Here ###
def __call__(self, x):
return self.forward(x)
## Test Tanh
act = Tanh()
x = np.random.randn(2,3)
print(f"act(x) =\n {act(x)}")
print(f"act.derivative(x) =\n {act.derivative(x)}")
act(x) = [[ 0.92525207 -0.5453623 -0.48398233] [-0.79057703 0.69903334 -0.98015695]] act.derivative(x) = [[0.1439086 0.70257996 0.76576111] [0.37498795 0.51135239 0.03929236]]
Additionally, we will define the Softmax activation to compute the output $\hat{y}$.
Note: The input $x$ has shape (batch_size, input_size)
. Your implementation should ensure the output y = softmax(x)
retains the shape (batch_size, input_size)
, and the sum of y
across each sample should be 1
.
Exercise [0.5/0.5]: Implement softmax.
def softmax(x):
### Code Here ###
### Code Here ###
## Test softmax
x = np.random.randn(4,5)
y = softmax(x)
print(f"softmax(x) =\n {y}")
print(f"sum of y for each sample =\n {np.sum(y, axis=1, keepdims=True)}")
softmax(x) = [[0.18376865 0.4103551 0.03269533 0.28088273 0.09229819] [0.17606771 0.11032588 0.11556621 0.08852581 0.50951439] [0.39110427 0.27540117 0.05782917 0.19338811 0.08227728] [0.10790932 0.17883443 0.21565153 0.24922814 0.24837658]] sum of y for each sample = [[1.] [1.] [1.] [1.]]
1 - Implement Recurrent Neural Network¶
1.1 Initialization [1/1]¶
Recall that we initialize the weights and bias in MLP using random initializaiton. Here we do the same random initializaiton for RNN. Specifically, for a weight matrix $W\in \mathbb{R}^{m\times n}$, each entry is randomly initialized using an i.i.d. Gaussian distribution: $$ W_{ij}\sim N\left(0,\frac{2}{n+m}\right) $$ We are fine to use zero initializaiton for bias. Note that in RNN, we have three weight matrices shared across all time steps: $W_x\in \mathbb{R}^{n_h\times n_x}$, $W_y\in \mathbb{R}^{n_y\times n_h}$, and $W_h\in \mathbb{R}^{n_h \times n_h}$.
Exercise [2/2]: Implement initialize_rnn_parameters
to randomly initilaize weights Wx
, Wy
, Wh
, and biases bh
and by
. Return a dictionary parameters
that contains all parameters.
# Initialize weights and biases
def initialize_rnn_parameters(n_x, n_h, n_y):
parameters = {}
### Code Here ###
### Code Here ###
parameters['Wx'] = Wx
parameters['Wh'] = Wh
parameters['Wy'] = Wy
parameters['bh'] = bh
parameters['by'] = by
return parameters
# Test initialize_rnn_parameters
np.random.seed(1)
n_x = 3
n_h = 9
n_y = 5
parameters = initialize_rnn_parameters(n_x, n_h, n_y)
print("Wx:", parameters['Wx'])
print("Wh:", parameters['Wh'])
print("Wy:", parameters['Wy'])
Wx: [[ 0.66313622 -0.24974851 -0.21562521] [-0.43803761 0.35330119 -0.93959924] [ 0.71231642 -0.31076142 0.13024717] [-0.10180503 0.59690307 -0.84104892] [-0.13162627 -0.15678953 0.46285944] [-0.44902873 -0.07039352 -0.3583842 ] [ 0.01723369 0.23793331 -0.4493259 ] [ 0.4673315 0.36807287 0.20514245] [ 0.3677729 -0.27913073 -0.05016972]] Wh: [[-0.31192314 -0.08929603 0.17678516 -0.23055358 -0.13225118 -0.22905757 -0.28173521 -0.22374871 -0.00422153] [-0.37243678 0.07813857 0.55326739 0.24734805 -0.06394518 -0.29587632 -0.24905276 0.56415153 0.01693592] [-0.21233188 0.06363849 0.70008505 0.04005298 0.20573437 0.10005677 -0.11741662 -0.3808394 -0.11644757] [-0.06963141 0.19554106 0.27966114 0.31036736 0.09519578 0.29504705 -0.25146598 0.41762272 0.17097661] [-0.09936428 0.16283938 -0.02519057 0.3772098 0.50660561 0.72852514 -0.46549878 -0.48137127 -0.16815529] [ 0.05334569 0.29205631 0.10521165 -0.67406707 -0.102068 0.27599155 0.07669825 0.25400373 -0.07410938] [-0.06691936 0.06218713 0.13668388 0.06609991 0.03966955 -0.2235541 0.1258546 0.04060709 0.37649464] [ 0.39963929 0.06171881 -0.12509498 -0.21291014 0.14116478 0.02578002 -0.11461789 0.01453229 -0.20666695] [ 0.23267734 -0.14904285 0.40816923 0.13449721 0.19785951 -0.36497062 0.05646081 0.24685215 -0.3179002 ]] Wy: [[-0.70434796 0.08628998 -3.63292695 0.83383337 2.23873064 -2.27406543 0.92745748 -3.47197556 -0.10237869] [-4.27493183 2.96699237 1.08184913 -0.06513034 -2.05088487 3.37004142 5.20446203 -4.91575795 3.2705826 ] [ 4.30635911 0.89429489 -3.17296497 2.28419701 -0.47867013 -1.59782379 -3.25442793 1.4565853 2.0975698 ] [-1.64970725 1.37731553 -3.02764273 2.12152488 0.12320549 -0.49361722 -0.26919428 2.29885669 1.98540258] [ 1.40083358 0.36432316 0.20589535 1.63608039 0.61512278 1.80586128 -0.82049186 -6.44197521 2.74847155]]
1-2 RNN Layer [2/2]¶
implement RNN. Recall that give a sequence $\{x_1, \dots, x_T\}$, a RNN unit take a hidden state vector $h_{t-1}$ from previous layer and the current input $x_t$ from the sequence, then update the hidden state $$ h_t = \tanh(W_h h_{t-1} + W_x x_t + b_h) $$ Then this update hidden state is used to compute the output at time $t$: $$ \hat{y}_t = \text{softmax}(W_y h_t + b_y) $$
However, to facility backpropogation, we will introduce extra intermediate variables and their values will be stored in the cache
that will be used in backpropogation to compute the gradients.
$$ \begin{align} &z_t = W_h h_{t-1} + W_x x_t + b_h\\ &h_t = \phi(z_t)\\ &u_t = W_y h_t + b_y\\ &\hat{y}_t = \text{softmax}(u_t) \end{align} $$ Note that we will not store the values of $u_t$, since it does not explicitly use thanks to the combination of cross entropy loss and softmax. Thus, we only need to return yhat_t
, h_t
, and z_t
.
Exercise [2/2]: Implement one layer RNN rnn_layer
with input x_t
and h_prev
, and parameters
. Return the yhat_t
, h_t
, and z_t
def rnn_layer(x_t, h_prev, parameters, act=Tanh()):
# xt: input at time t, shape (batch_size, input_size)
# h_prev: previous hidden state, shape (batch_size, hidden_size)
# Wh: weights for hidden to hidden, shape (hidden_size, hidden_size)
# Wx: weights for input to hidden, shape (hidden_size, input_size)
# Wy: weights for hidden to output, shape (output_size, hidden_size)
# bh: bias for hidden state, shape (hidden_size, 1)
# by: bias for output, shape (output_size, 1)
Wh = parameters['Wh']
Wx = parameters['Wx']
Wy = parameters['Wy']
bh = parameters['bh']
by = parameters['by']
### Code Here ###
### Code Here ###
return yhat_t, h_t, z_t
# Test rnn_layer()
np.random.seed(1)
batch_size = 2
x = np.random.randn(batch_size, n_x)
pre_h = np.random.randn(batch_size, n_h)
y_t, h_t, z_t = rnn_layer(x, pre_h, parameters)
print("y_t.shape:", y_t.shape)
print("y_t:", y_t)
y_t.shape: (2, 5) y_t: [[2.31497458e-04 9.24265539e-03 9.75012681e-01 1.44409055e-02 1.07226113e-03] [1.08739880e-05 9.99969433e-01 1.37498024e-08 4.22887287e-06 1.54503763e-05]]
1-3 RNN Forward Pass [2/2]¶
Now, we are ready to build the forward()
pass for RNN. Given inputs
, we stack the rnn_layer
w.r.t. the sequence_length
. We store all the intermediate values h_t
, yhat_t
, z_t
, and x_t
are stored in the cache
. At the end, we return the output yhats
and cache
.
Exercise [2/2]: Implement forward()
by stackin the rnn_layer
and return the outputs yhats
and cache
that stores hs
, zs
, yhats
, and inputs
.
def forward(inputs, parameters, act=Tanh()):
# inputs: (batch_size, sequence_length, input_size)
# parameters: dictionary, containing (Wh, Wx, Wy, bh, by)
batch_size, sequence_length, input_size = inputs.shape
hidden_size = parameters['Wh'].shape[0]
output_size = parameters['Wy'].shape[0]
# Initialize outputs, hidden states, pre activation
yhats = np.zeros((batch_size, sequence_length, output_size))
hs = np.zeros((batch_size, sequence_length, hidden_size))
zs = np.zeros((batch_size, sequence_length, hidden_size))
# Initialize the previous hidden state
h_prev = np.zeros((batch_size, hidden_size))
# Loop through each time step
for t in range(sequence_length):
# Extract the input at time t and perform RNN layer calculations
### Code Here ###
### Code Here ###
# Store the results and Update the previous hidden state
### Code Here ###
### Code Here ###
# Cache the intermediate values
### Code Here ###
### Code Here ###
return yhats, cache
## Test
np.random.seed(1)
batch_size = 7
sequence_length = 2
n_x = 3
n_h = 9
n_y = 5
x = np.random.randn(batch_size, sequence_length, n_x)
parameters = initialize_rnn_parameters(n_x, n_h, n_y)
outputs, cache = forward(x, parameters)
print("y_pred.shape:", outputs.shape)
print("y_pred:", outputs)
y_pred.shape: (7, 2, 5) y_pred: [[[8.14030433e-01 1.85697573e-01 1.19428504e-04 1.50260441e-04 2.30558817e-06] [9.15500527e-01 5.76525372e-04 8.38080431e-02 7.74767803e-05 3.74277242e-05]] [[2.25965257e-01 7.70775868e-01 1.28454261e-03 1.82887200e-03 1.45460715e-04] [7.46652214e-01 9.38900248e-03 2.41129282e-01 2.58439814e-03 2.45103162e-04]] [[6.74700721e-04 2.42456728e-02 2.17135742e-02 1.73002039e-01 7.80364014e-01] [8.76059591e-01 1.12555365e-01 9.12727590e-03 1.98127191e-03 2.76496978e-04]] [[7.75369988e-01 1.82259822e-02 1.86318767e-01 1.21015190e-02 7.98374302e-03] [2.08883982e-05 9.38296085e-05 2.43152289e-02 2.77351744e-02 9.47834879e-01]] [[6.52061336e-01 3.45708061e-01 5.87547698e-04 1.58770786e-03 5.53476937e-05] [4.03010060e-02 1.03779191e-02 7.59748679e-01 8.62274703e-02 1.03344925e-01]] [[9.56819502e-01 3.64582631e-02 3.55765118e-03 2.85283239e-03 3.11751088e-04] [7.69509091e-01 1.84469051e-02 2.83176522e-02 9.55487248e-02 8.81776273e-02]] [[2.23020703e-07 6.61925876e-06 9.91300502e-03 4.96653305e-03 9.85113620e-01] [1.52567204e-01 8.46430959e-01 8.59757875e-05 8.82382096e-04 3.34785553e-05]]]
1-4 Compute Cost [1/1]¶
Now let us compute the cost using cross entropy loss. Note that we have target sequence $y=(y_1, \dots, y_T)$ and output sequence $\bar{y}=(\bar{y}_1, \dots, \bar{y}_T)$, where each $y_t$ and $\bar{y}_t$ are a $n_y\times 1$ vector. Their cross entropy loss at time $t$ is $$ \ell(y_t, \bar{y}_t) = -y_t \cdot \log \bar{y}_t $$ Then the coss across all time is $$ \ell(y, \bar{y}) = \frac{1}{T}\sum_{t} \ell(y_t, \bar{y}_t) $$ Considering all training sample, we have $$ L = \frac{1}{N}\sum_{i=1}^{N} \ell(y^{(i)}, \bar{y}^{(i)}) = \frac{1}{NT}\sum_{i=1}^{N}\sum_{t=1}^{T} \ell(y_t^{(i)}, \bar{y}_t^{(i)}) =-\frac{1}{NT} \sum_{i=1}^{N}\sum_{t=1}^{T} y_t \cdot \log \bar{y}_t $$
Exercise [1/1]: Implemente the compute_cost()
def compute_cost(targets, outputs):
# Compute the cross-entropy loss for the predicted outputs
# targets: (batch_size, sequence_length, output_size)
# outputs: (batch_size, sequence_length, output_size)
batch_size, sequence_length, output_size = targets.shape
# Compute the cross-entropy loss
### Code Here ###
### Code Here ###
return loss
def generate_one_hot_targets(batch_size, sequence_length, n_y):
random_class_indices = np.random.randint(0, n_y, size=(batch_size, sequence_length))
one_hot_targets = np.zeros((batch_size, sequence_length, n_y))
one_hot_targets[np.arange(batch_size)[:, None], np.arange(sequence_length), random_class_indices] = 1
return one_hot_targets
np.random.seed(123)
targets = generate_one_hot_targets(batch_size, sequence_length, n_y)
loss = compute_cost(targets, outputs)
print("loss:", loss)
loss: 5.676070798625509
1-5 Backpropogation [3/3]¶
Recall that we have forward $$ \begin{align} z_t = &W_h h_{t-1} + W_x x_t + b_h\\ h_t = &\phi(z_t)\\ u_t = & W_y h_t + b_y\\ \bar{y}_t = & \text{softmax}(u_t) \end{align} $$
Using chain rule, we have $$ \begin{align} &du_t= \bar{y}_t-y_t\\ &dW_y = d u_t h_t^{\top}\\ &db_y = d u_t\\ &dh_t = W_y^{\top} d u_t + W_h^{\top} dz_{t+1} \\ &dz_t = \phi^{\prime}(z_t) \odot dh_t\\ &dW_h = dz_t h_{t-1}^{\top}\\ &dW_x = dz_t x_t^{\top}\\ &db_h = dz_t \end{align} $$ We can see we need to pass $d z_t$ to the next time step for back propogate.
To use vectorizaiton, suppose x_t
is a batch of input with shape (batch_size, input_size)
. Then the forward becomes $$ \begin{align} &Z_t = H_{t-1}W_h^{\top} + X_t W_x^{\top} + e b_h^{\top} \\ &H_t = \phi(Z_t)\\ &U_t = H_t W_y^{\top} + e b_y^{\top}\\ &\bar{Y}_t = \text{softmax}(U_t) \end{align} $$ where $$ \begin{align} Z_t = \begin{bmatrix} z_t^{(1)} & z_t^{(2)} & \cdots & z_t^{(N)} \end{bmatrix} ^{\top} \in \mathbb{R}^{N\times n_h}, \quad e = \begin{bmatrix} 1 & 1 & \cdots & 1 \end{bmatrix}^{\top} \in \mathbb{R}^{N\times 1} \end{align} $$ Then the gradients become $$ \begin{align} &dU_t =\frac{1}{NT} (\bar{Y}_t - Y_t)\\ &dW_y = \sum_t dU_t^{\top} H_t\\ &db_y = \sum_t dU_t^{\top} e\\ &dH_t = dU_t W_y+ dZ_{t+1} W_h\\ &dZ_t = \phi^{\prime}(Z_t) \odot dH_t\\ &dW_h = \sum_t (dZ_t)^{\top} H_{t-1}\\ &dW_x = \sum_t (dZ_t)^{\top} X_t\\ &db_h = \sum_t (dZ_t)^{\top} e \end{align} $$ Note: because the weights and baises are shared across all steps. The gradients are also accumulated across all time steps.
Exercise [3/3]: Impelente backward()
method and return grads
.
def backward(y, parameters, caches, act_derivative=Tanh().derivative):
# y: (batch_size, sequence_length, output_size)
# parameters: dictionary, containing (Wh, Wx, Wy, bh, by)
# caches: dictionary, containing (yhats, hs, zs, inputs)
batch_size, sequence_length, output_size = y.shape
hidden_size = parameters['Wh'].shape[0]
input_size = parameters['Wx'].shape[1]
Wy = parameters['Wy']
Wh = parameters['Wh']
Wx = parameters['Wx']
bh = parameters['bh']
by = parameters['by']
# Initialize gradients
dWh = np.zeros_like(parameters['Wh'])
dWx = np.zeros_like(parameters['Wx'])
dWy = np.zeros_like(parameters['Wy'])
dbh = np.zeros_like(parameters['bh'])
dby = np.zeros_like(parameters['by'])
grads ={}
dz_t_next = np.zeros((batch_size, hidden_size))
for t in reversed(range(sequence_length)):
# Retrieve cached values for current time step
### Code Here ###
### Code Here ###
# Handle previous hidden state (use zeros if t=0)
### Code Here ###
### Code Here ###
# Compute output error
### Code Here ###
### Code Here ###
# Compute gradients for output weights and biases
### Code Here ###
### Code Here ###
# Backpropagate error to hidden state
### Code Here ###
### Code Here ###
# Compute gradient w.r.t. pre-activation z_t
### Code Here ###
### Code Here ###
# Compute gradients for hidden-to-hidden weights, input-to-hidden weights, and biases
### Code Here ###
### Code Here ###
# Update dz_t_next
### Code Here ###
### Code Here ###
grads['dWh'] = dWh
grads['dWx'] = dWx
grads['dWy'] = dWy
grads['dbh'] = dbh
grads['dby'] = dby
return grads
## Test
grads = backward(targets, parameters, cache)
print("grads['dWx'].shape:", grads['dWx'].shape)
print("grads['dWx']:", grads['dWx'])
grads['dWx'].shape: (9, 3) grads['dWx']: [[-0.59072159 0.40433816 1.10746108] [ 0.958936 0.05239214 1.00942008] [-0.5482359 0.32950661 -1.2185552 ] [ 0.28912497 0.23787591 -3.48232618] [-1.60721343 0.91352167 0.52073181] [ 0.72894355 -0.49311288 0.55852002] [-0.04680902 -0.37165716 2.42702709] [ 1.07534461 -0.60867996 1.19172103] [-0.96090388 0.4181982 0.33683314]]
1.6 Build RNN [1/1]¶
Now that we have all the components, it is time to build the RecurrentNeuralNetwork
class. It has three essential methods
initialize_parameters
random initialize each weights and zero initialize biasesforward()
: giveninputs
, we useself.parameters
to compute theoutputs
and return. Meanwhile, we store the intermediate values inself.cache
.backward()
: Employ backpropogation through time to compute the gradients for weights and biases.
Exercise [1/1]: Implement the RecurrentNeuralNetwork
class.
class RecurrentNeuralNetwork:
def __init__(self, input_size, output_size, hidden_size, act=Tanh()):
# Save input parameters to attributes and initialize parameters
### Code Here ###
### Code Here ###
def initialize_parameters(self):
self.parameters = {}
### Code Here ###
### Code Here ###
def forward(self, inputs):
self.caches = {}
### Code Here ###
### Code Here ###
return outputs
def backward(self, y):
self.grads = {}
### Code Here ###
### Code Here ###
self.grads = grads
def __call__(self, inputs):
return self.forward(inputs)
# Test Recurrent NeuralNetwork and its forward()
np.random.seed(1)
n_x = 3
n_h = 9
n_y = 5
batch_size = 4
sequence_length = 2
x = np.random.randn(batch_size, sequence_length, n_x)
y = generate_one_hot_targets(batch_size, sequence_length, n_y)
network = RecurrentNeuralNetwork(n_x, n_y, n_h)
outputs = network(x)
print(f"Shape of outputs: {outputs.shape}")
# print(f"outputs = \n{outputs}")
Shape of outputs: (4, 2, 5)
# Test the backward() method in NeuralNetwork
network.backward(y)
print(f"Shape of grads: {network.grads['dWx'].shape}")
print(f"grads['dWx'] = \n{network.grads['dWx']}")
Shape of grads: (9, 3) grads['dWx'] = [[-0.33408226 0.20717265 -0.17950987] [ 0.42339331 0.24859148 0.09659816] [ 0.85870903 -0.41918044 0.00496727] [ 0.45031315 -0.47297622 0.71640565] [ 0.52565742 -0.23802471 0.01545178] [ 0.06318471 -0.34500968 0.19753802] [-0.01849903 0.19205179 0.17765888] [ 0.23638977 -0.05530361 -0.19717733] [ 0.67169325 -0.48560953 0.41511517]]
2 - Training RNN using Gradient Descent¶
We now can train the RNN using gradient descent.
def gradient_descent_step(parameters, grads, learning_rate):
for key in parameters.keys():
parameters[key] -= learning_rate * grads[f'd{key}']
gradient_descent_step(network.parameters, network.grads, learning_rate=0.01)
for key in network.parameters:
print(f"{key} = \n{network.parameters[f'{key}']}")
Wx = [[-0.01985759 -0.32381429 0.0294375 ] [-0.4283655 -0.11013178 0.21877389] [ 0.61532254 0.80300178 0.1951748 ] [ 0.04421935 -0.34532894 -0.31822794] [ 0.25573856 0.02178268 1.271471 ] [-0.21800838 0.14453404 -0.17569076] [-0.9137896 0.71471729 -0.10963308] [ 0.34313437 0.85245801 0.21690564] [ 0.39492141 -0.77987877 -0.36946202]] Wh = [[ 0.10986031 0.3496242 0.44979252 0.41524094 -0.23131324 0.42734633 -0.24182157 -0.26638211 -0.00153421] [ 0.17096413 -0.0548569 0.44763083 -0.20553441 -0.24781734 0.33887128 -0.18461997 0.14480114 -0.04988595] [ 0.00799085 -0.08548738 0.6127309 -0.06509604 0.52612364 0.25211712 -0.17554081 -0.25126458 0.02946636] [ 0.56049949 -0.06302826 -0.05351696 -0.45793266 0.01562913 -0.52317622 -0.43101348 0.02469749 -0.29477515] [-0.07760832 -0.31835933 0.00785813 -0.32343905 -0.25461273 -0.08842911 -0.05770037 -0.62784493 0.5360078 ] [ 0.38216656 -0.16221317 0.34092179 -0.32828395 -0.41293664 0.64027094 -0.14951357 0.51540084 -0.63782513] [-0.331346 -0.13554117 0.03370381 -0.28909834 0.12165935 -0.31296643 -0.4624453 -0.11332773 -0.05977637] [-0.08033296 -0.08249913 -0.05623839 0.19120769 -0.31898973 -0.21712552 0.23985408 0.41145335 -0.08891493] [ 0.52267962 -0.32835657 0.02762499 0.00970381 -0.38607831 -0.36657855 -0.0409012 -0.24448924 0.2266115 ]] Wy = [[-1.1509867 -2.70785963 0.33763363 1.89973091 1.02847637 2.83459239 3.11818766 -2.39918769 4.53899388] [ 0.4360091 1.33122474 -2.38190165 4.3178705 -1.3831266 -1.51272681 3.78791469 -2.23010359 -1.37526698] [-0.11224725 -1.1086377 -1.31274024 3.07604818 -1.09776538 0.25854946 -1.69262062 -0.58312581 1.4589016 ] [-0.12766189 -0.91789953 -5.44624739 0.02015283 -2.57381152 0.23709946 2.66270509 -2.35274898 -3.53495414] [-0.67905862 1.04181555 -5.01694331 -2.28537271 -2.69671482 0.55621664 -0.32703492 0.7893244 1.52645133]] bh = [[ 0.00441775] [ 0.00093143] [-0.00193035] [-0.00258762] [-0.00288833] [-0.00309124] [-0.00920422] [ 0.00148524] [ 0.00653478]] by = [[-0.00085045] [-0.00016961] [ 0.00106684] [-0.00045858] [ 0.0004118 ]]
def train(network, inputs, targets, learning_rate=0.01, max_iter=100):
losses = []
for i in range(max_iter):
outputs = network(inputs)
losses.append(compute_cost(outputs, targets))
network.backward(y)
gradient_descent_step(network.parameters, network.grads, learning_rate)
if i % 100 == 0:
print(f"Iteration {i}: Loss = {losses[-1]}")
return losses
np.random.seed(1)
n_x = 3
n_h = 100
n_y = 5
batch_size = 10
sequence_length = 50
x = np.random.randn(batch_size, sequence_length, n_x)
y = generate_one_hot_targets(batch_size, sequence_length, n_y)
network = RecurrentNeuralNetwork(n_x, n_y, n_h)
losses = train(network, x, y, learning_rate=0.01, max_iter=1000)
plt.plot(losses)
plt.show()
Iteration 0: Loss = 18.15663669722336 Iteration 100: Loss = 0.8782191175534547 Iteration 200: Loss = 0.5335292699306723 Iteration 300: Loss = 0.4342417973404307 Iteration 400: Loss = 0.37848605036261607 Iteration 500: Loss = 0.3405588773870283 Iteration 600: Loss = 0.3119290213417056 Iteration 700: Loss = 0.2888488852341271 Iteration 800: Loss = 0.26941228378657167 Iteration 900: Loss = 0.25255578204420387