Optimization, as an important part of deep learning, has attracted much attention from researchers, with the exponential growth of the amount of data. Neural networks consist of millions of parameters to handle the complexities became a challenge for researchers, these algorithms have to be more efficient to achieve better results. The functionalities of the optimization algorithm are to minimize the loss function by reaching global minima.

The two important metrics to determine the efficiency of algorithms are the speed of convergence which is the process of reaching the global minima, and the generalization to new data that means how the model is performing on unseen data. Based on these two metrics researchers built the optimization algorithms. Throughout this article, we will discuss these optimization techniques with their intuition and implementation.

Topics we cover in this article

  • Understanding adaptive optimization 
  • Adagrad optimization
  • Adadelta optimization
  • Adam optimization
  • Adabound optimization

Understanding Adaptive optimization

Optimization techniques like Gradient Descent, SGD, mini-batch Gradient Descent need to set a hyperparameter learning rate before training the model. If this learning rate doesn’t give good results, we need to change the learning rates and train the model again. In deep learning, training the model generally takes lots of time. Some researchers are fed up with setting up these learning rates. Hence they got an idea of Adaptive optimization techniques. Here, it doesn’t need to set learning rate, just we need to initialize the learning rate parameters 0.001  and these adaptive optimization algorithms keep updating learning rates while training the model. 

So what is the learning rate …………?  The learning rate is the most important aspect of the learning process of the model. These are steps the model takes to reach the global minima.

Hands-on implementation

Here we will implement the Convolutional Neural Network (CNN) model in MNIST data classification through which we will compare the optimization techniques.

from keras.datasets import mnist
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization

# Model configuration
batch_size = 250
no_epochs = 5
no_classes = 10
validation_split = 0.2
verbosity = 1

# Load KMNIST dataset
(input_train, target_train), (input_test, target_test) =mnist.load_data()

# Shape of the input sets
input_train_shape = input_train.shape
input_test_shape = input_test.shape 

# Keras layer input shape
input_shape = (input_train_shape[1], input_train_shape[2], 1)

# Reshape the training data to include channels
input_train = input_train.reshape(input_train_shape[0], input_train_shape[1], input_train_shape[2], 1)

input_test = input_test.reshape(input_test_shape[0], input_test_shape[1], input_test_shape[2], 1)

# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')

# Normalize input data
input_train = input_train / 255
input_test = input_test / 255

# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(no_classes, activation='softmax'))

Now, we will discuss the adaptive optimization techniques one-by-one and add it to the above-defined CNN model.

Adagrad 

Adagrad works on setting the learning rate by dividing the learning rate component by the square root of the cumulative sum of the current gradient and the previous gradient.     

Adaptive Optimization techniques in Deep learning

Here ? is the parameter we need to update, ? is the learning rate ? is added to give non zero value, Gt is the gradient estimate at time t.

Compiling the CNN model with Adagrad Optimizer

# Compile the model
model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,optimizer=tensorflow.keras.optimizers.Adagrad(
    learning_rate=0.001,
    initial_accumulator_value=0.1,
    epsilon=1e-07))

# Fit data to model
history = model.fit(input_train, target_train,
            batch_size=batch_size,
            epochs=no_epochs,
            verbose=verbosity,
            validation_split=validation_split)

# Generate generalization metric  s
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss using Adagrad: {score[0]} / Test accuracy: {score[1]}')

#OUTPUT
Adaptive Optimization techniques in Deep learning

Adadelta

Adadelta works on exponential moving averages of the squared delta’s, here delta refers to the difference between the current weight and the newly updated weight. In Adadelta optimization technique it removes the learning rate parameter and replaces it with delta. 



Compiling the CNN model with Adadelta Optimizer replacing Adagrad in the above CNN model

model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,optimizer = tensorflow.keras.optimizers.Adadelta(learning_rate=0.001, rho=0.95, epsilon=1e-07, name="Adadelta"))

After updating the optimizer to Adadelta, we again trained the model.

#OUTPUT

Adam – Adaptive moment estimation 

Beginners mostly used the Adam optimization technique very popular and used in many models as an optimizer, adam is a combination of RMS prop and momentum, it uses the squared gradient to scale the learning rate parameters like RMSprop and it works similar to the momentum by adding averages of moving gradients. It computes different parameters for individual parameters.

Momentum

In momentum technique, instead of using only the gradient of current steps it also accumulates the gradient of the past steps to reach global minima. We use SGD with momentum to work effectively and momentum help SGD to accelerate training.


Momentum term ? = 0.9 
Adaptive Optimization techniques in Deep learning

Here m and v are moving averages of the gradients and Betas only used in Adam optimization uses parameters are beta_1 = 0.9 and beta_2 =0.999 and g is the gradients in mini-batch. 

Compiling the CNN model with Adam Optimizer replacing in the above CNN model

model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,optimizer=tensorflow.keras.optimizers.Adam(learning_rate=0.01),metrics=['accuracy'])

After updating the optimizer to Adam, we again trained the model

#OUTPUT

Adabound 

Adabound is an Adam variant that uses dynamic boundaries of learning rates, Adabound is as fast as Adam and as good as SGD, the main problem in adaptive techniques is they fail in convergence better because of insatiable and extreme learning rates, where the lower and upper bounds are initialized as 0 and infinity.

This concept was inspired by gradient clipping the gradients larger than the threshold to avoid gradient explosion.

Adaptive Optimization techniques in Deep learning

Compiling the CNN model with Adabound Optimizer replacing in the above CNN model 

In the below code snippet, we are importing Adabound because in Keras optimizer’s library Adabound is not an inbuilt function, so we need to import Adabound.

from keras_adabound import AdaBound
model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,optimizer=AdaBound(lr=1e-3, final_lr=0.1))

After updating the optimizer to Adabound, we again trained the model          

#OUTPUT
Adaptive Optimization techniques in Deep learning

Conclusion

In this article, we discussed the adaptive optimization techniques and demonstrated the implementation. As we discussed above the best optimization algorithm will have better convergence and good generalization to new data. As we have seen the Adabound optimization introduced has a higher accuracy as compared to other optimizers, which balances the convergence and generalization.

The post Understanding Adaptive Optimization techniques in Deep learning appeared first on Analytics India Magazine.