Understanding Adaptive Optimization techniques in Deep learning
Optimization, as an important part of deep learning, has attracted much attention from researchers, with the exponential growth of the amount of data. Neural networks consist of millions of parameters to handle the complexities became a challenge for researchers, these algorithms have to be more efficient to achieve better results. The functionalities of the optimization algorithm are to minimize the loss function by reaching global minima.
The two important metrics to determine the efficiency of algorithms are the speed of convergence which is the process of reaching the global minima, and the generalization to new data that means how the model is performing on unseen data. Based on these two metrics researchers built the optimization algorithms. Throughout this article, we will discuss these optimization techniques with their intuition and implementation.
Topics we cover in this article
- Understanding adaptive optimization
- Adagrad optimization
- Adadelta optimization
- Adam optimization
- Adabound optimization
Understanding Adaptive optimization
Optimization techniques like Gradient Descent, SGD, mini-batch Gradient Descent need to set a hyperparameter learning rate before training the model. If this learning rate doesn’t give good results, we need to change the learning rates and train the model again. In deep learning, training the model generally takes lots of time. Some researchers are fed up with setting up these learning rates. Hence they got an idea of Adaptive optimization techniques. Here, it doesn’t need to set learning rate, just we need to initialize the learning rate parameters 0.001 and these adaptive optimization algorithms keep updating learning rates while training the model.
So what is the learning rate …………? The learning rate is the most important aspect of the learning process of the model. These are steps the model takes to reach the global minima.
Hands-on implementation
Here we will implement the Convolutional Neural Network (CNN) model in MNIST data classification through which we will compare the optimization techniques.
from keras.datasets import mnist import tensorflow from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Flatten from tensorflow.keras.layers import Conv2D, MaxPooling2D from tensorflow.keras.layers import BatchNormalization # Model configuration batch_size = 250 no_epochs = 5 no_classes = 10 validation_split = 0.2 verbosity = 1 # Load KMNIST dataset (input_train, target_train), (input_test, target_test) =mnist.load_data() # Shape of the input sets input_train_shape = input_train.shape input_test_shape = input_test.shape # Keras layer input shape input_shape = (input_train_shape[1], input_train_shape[2], 1) # Reshape the training data to include channels input_train = input_train.reshape(input_train_shape[0], input_train_shape[1], input_train_shape[2], 1) input_test = input_test.reshape(input_test_shape[0], input_test_shape[1], input_test_shape[2], 1) # Parse numbers as floats input_train = input_train.astype('float32') input_test = input_test.astype('float32') # Normalize input data input_train = input_train / 255 input_test = input_test / 255 # Create the model model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) model.add(BatchNormalization()) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(BatchNormalization()) model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) model.add(BatchNormalization()) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(BatchNormalization()) model.add(Flatten()) model.add(Dense(256, activation='relu')) model.add(BatchNormalization()) model.add(Dense(no_classes, activation='softmax'))
Now, we will discuss the adaptive optimization techniques one-by-one and add it to the above-defined CNN model.
Adagrad
Adagrad works on setting the learning rate by dividing the learning rate component by the square root of the cumulative sum of the current gradient and the previous gradient.
Here ? is the parameter we need to update, ? is the learning rate ? is added to give non zero value, Gt is the gradient estimate at time t.
Compiling the CNN model with Adagrad Optimizer
# Compile the model model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,optimizer=tensorflow.keras.optimizers.Adagrad( learning_rate=0.001, initial_accumulator_value=0.1, epsilon=1e-07)) # Fit data to model history = model.fit(input_train, target_train, batch_size=batch_size, epochs=no_epochs, verbose=verbosity, validation_split=validation_split) # Generate generalization metric s score = model.evaluate(input_test, target_test, verbose=0) print(f'Test loss using Adagrad: {score[0]} / Test accuracy: {score[1]}') #OUTPUT
Adadelta
Adadelta works on exponential moving averages of the squared delta’s, here delta refers to the difference between the current weight and the newly updated weight. In Adadelta optimization technique it removes the learning rate parameter and replaces it with delta.
Compiling the CNN model with Adadelta Optimizer replacing Adagrad in the above CNN model
model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,optimizer = tensorflow.keras.optimizers.Adadelta(learning_rate=0.001, rho=0.95, epsilon=1e-07, name="Adadelta"))
After updating the optimizer to Adadelta, we again trained the model.
#OUTPUT
Adam – Adaptive moment estimation
Beginners mostly used the Adam optimization technique very popular and used in many models as an optimizer, adam is a combination of RMS prop and momentum, it uses the squared gradient to scale the learning rate parameters like RMSprop and it works similar to the momentum by adding averages of moving gradients. It computes different parameters for individual parameters.
Momentum
In momentum technique, instead of using only the gradient of current steps it also accumulates the gradient of the past steps to reach global minima. We use SGD with momentum to work effectively and momentum help SGD to accelerate training.
Momentum term ? = 0.9
Here m and v are moving averages of the gradients and Betas only used in Adam optimization uses parameters are beta_1 = 0.9 and beta_2 =0.999 and g is the gradients in mini-batch.
Compiling the CNN model with Adam Optimizer replacing in the above CNN model
model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,optimizer=tensorflow.keras.optimizers.Adam(learning_rate=0.01),metrics=['accuracy'])
After updating the optimizer to Adam, we again trained the model
#OUTPUT
Adabound
Adabound is an Adam variant that uses dynamic boundaries of learning rates, Adabound is as fast as Adam and as good as SGD, the main problem in adaptive techniques is they fail in convergence better because of insatiable and extreme learning rates, where the lower and upper bounds are initialized as 0 and infinity.
This concept was inspired by gradient clipping the gradients larger than the threshold to avoid gradient explosion.
Compiling the CNN model with Adabound Optimizer replacing in the above CNN model
In the below code snippet, we are importing Adabound because in Keras optimizer’s library Adabound is not an inbuilt function, so we need to import Adabound.
from keras_adabound import AdaBound model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy,optimizer=AdaBound(lr=1e-3, final_lr=0.1))
After updating the optimizer to Adabound, we again trained the model
#OUTPUT
Conclusion
In this article, we discussed the adaptive optimization techniques and demonstrated the implementation. As we discussed above the best optimization algorithm will have better convergence and good generalization to new data. As we have seen the Adabound optimization introduced has a higher accuracy as compared to other optimizers, which balances the convergence and generalization.
The post Understanding Adaptive Optimization techniques in Deep learning appeared first on Analytics India Magazine.