4.3. Optimization

While the gradient descent algorithm has been used in the sample codes for optimizing gradients, more efficient options exist, as discussed in Appendix 2.3.

This section introduces a powerful algorithm called Adam and showcases its capabilities.

Reference

Theoretical background of Adam:

4.3.1. Implementation

Complete Python code is available at: Optimizer.py

4.3.1.1. Adam Class

Below is the Adam class:

#
# ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION
# https://arxiv.org/pdf/1412.6980v8.pdf
#
class Adam:

    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
        self.lr = lr  # Learning rate
        self.beta1 = beta1  # m's attenuation rate
        self.beta2 = beta2  # v's attenuation rate
        self.iter = 0
        self.m = None  # Momentum
        self.v = None  # Adaptive learning rate

    def update(self, params, grads):

        if self.m is None:
            self.m = []
            self.v = []
            for param in params:
                self.m.append(np.zeros_like(param))
                self.v.append(np.zeros_like(param))

        self.iter += 1
        lr_t = (self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter))
        for i in range(len(params)):
            self.m[i] = (self.beta1 * self.m[i] + (1 - self.beta1) * grads[i])
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * (grads[i] ** 2)
            params[i] -= lr_t * self.m[i] / (np.sqrt(self.v[i]) + 1e-7)
4.3.1.2. update_weights() function

The update_weights() function serves as a wrapper for weight updates, capable of utilizing either Adam’s update function or gradient descent. It adaptively selects the optimization method based on the presence of the optimizer parameter:

  • If the optimizer parameter is specified, the function calls Adam’s update function to adjust the weights.
  • If the optimizer parameter is absent (i.e., $\text{optimizer} == \text{None}$), the function defaults to the gradient descent algorithm for weight updates.
def update_weights(layers, lr=0.01, optimizer=None, max_norm=None):

    grads = []
    params = []
    for layer in layers:
        grads.extend(layer.get_grads())
        params.extend(layer.get_params())

    # Clip gradient
    if max_norm is not None:
        grads = clip_gradient_norm(grads, max_norm)

    # Weights and Bias Update
    if optimizer == None:
        # gradient descent
        for i in range(len(grads)):
            params[i] -= lr * grads[i]
    else:
        # ADAM
        optimizer.update(params, grads)

4.3.2. Demonstration

To provide a practical demonstration of Adam’s effectiveness compared to gradient descent, I developed a Python code named XOR-gate-adam.py. This code trains an XOR gate using both algorithms, allowing for direct performance comparison.

Here is a snippet of the Python code that is relevant to Adam.

dense = Layers.Dense(input_nodes, hidden_nodes, sigmoid, deriv_sigmoid)
dense_1 = Layers.Dense(hidden_nodes, output_nodes, sigmoid, deriv_sigmoid)

# ========================================
# Training
# ========================================

def train_adam(x, Y, optimizer):

    # Forward Propagation
    y = dense.forward_prop(x)
    y = dense_1.forward_prop(y)

    # Back Propagation
    loss = np.sum((y - Y) ** 2 / 2)
    dL = (y - Y)

    dx = dense_1.back_prop(dL)
    _ = dense.back_prop(dx)

    # Weights and Bias Update
    update_weights([dense, dense_1], optimizer=optimizer)

    return loss


lr = 0.1
beta1 = 0.9
beta2 = 0.999
optimizer = Optimizer.Adam(lr=lr, beta1=beta1, beta2=beta2)

history_loss_adam = []


#
# Training loop
#
for epoch in range(1, n_epochs + 1):

    loss = 0.0

    for i in range(0, len(Y)):
        loss += train_adam(X[i], Y[i], optimizer)

Run the following command to execute the Python code and compare the performance of Adam and gradient descent algorithms on the XOR gate problem:

$ python XOR-gate-adam.py
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 dense (Dense)               (None,  3)                    9
 dense_1 (Dense)             (None,  1)                    4
=================================================================
Total params: 13

========= Gradient Descent =========
epoch: 1 / 10000  Loss = 0.670351
epoch: 1000 / 10000  Loss = 0.473633
epoch: 2000 / 10000  Loss = 0.374803
epoch: 3000 / 10000  Loss = 0.258562
epoch: 4000 / 10000  Loss = 0.091229
epoch: 5000 / 10000  Loss = 0.031913
epoch: 6000 / 10000  Loss = 0.016630
epoch: 7000 / 10000  Loss = 0.010774
epoch: 8000 / 10000  Loss = 0.007822
epoch: 9000 / 10000  Loss = 0.006079
epoch: 10000 / 10000  Loss = 0.004941
------------------------
x0 XOR x1 => result
========================
 0 XOR  0 => 0.0556
 0 XOR  1 => 0.9531
 1 XOR  0 => 0.9532
 1 XOR  1 => 0.0489
========================

=========       ADAM       =========
epoch: 1 / 10000  Loss = 0.579207
epoch: 1000 / 10000  Loss = 0.000240
epoch: 2000 / 10000  Loss = 0.000025
epoch: 3000 / 10000  Loss = 0.000003
epoch: 4000 / 10000  Loss = 0.000000
epoch: 5000 / 10000  Loss = 0.000000
epoch: 6000 / 10000  Loss = 0.000000
epoch: 7000 / 10000  Loss = 0.000000
epoch: 8000 / 10000  Loss = 0.000000
epoch: 9000 / 10000  Loss = 0.000000
epoch: 10000 / 10000  Loss = 0.000000
------------------------
x0 XOR x1 => result
========================
 0 XOR  0 => 0.0000
 0 XOR  1 => 1.0000
 1 XOR  0 => 1.0000
 1 XOR  1 => 0.0000
========================

The results demonstrate that Adam converges significantly faster than gradient descent.

Note

Although this is rare, convergence can be hindered by factors like initial values.