Different Variants of gradient descent

4 min readFeb 21, 2023

implementing gradient descent in python (click here)
Implemention of particle swarm optimization (click here)
Implementing Gradient Descent In Quality Control to minimize “Defect rate” — Python (click here)

There are several variants of gradient descent that are commonly used in machine learning and optimization. Here are some of the most popular ones:

1. Batch gradient descent:

Also known as vanilla gradient descent, is the simplest variant of gradient descent. In batch gradient descent, the entire training dataset is used to compute the gradients of the cost function with respect to the model parameters in each iteration. This can be computationally expensive for large datasets, but it guarantees convergence to a local minimum of the cost function.

θ = θ − α · ∇J(θ)
where:
θ is the parameter vector
α is the learning rate
∇J(θ) is the gradient of the cost function J with respect to θ

2. Stochastic Gradient Descent (SGD):

Stochastic gradient descent is a variant of gradient descent that updates the model parameters for each training example in the dataset. Unlike batch gradient descent, which uses the entire dataset to compute the gradients, SGD updates the parameters based on a randomly selected training example. This can lead to faster convergence because the updates are more frequent and noisy, but it can also result in more oscillations in the cost function due to the randomness of the updates.

θ = θ − α · ∇Ji(θ)
where:
θ is the parameter vector
α is the learning rate
∇Ji(θ) is the gradient of the cost function J with respect to θ computed on
a single randomly selected training example i

3. Mini-batch Gradient Descent:

Mini-batch gradient descent is a compromise between batch gradient descent and stochastic gradient descent. In mini-batch gradient descent, the gradients are computed on a small random subset of the training dataset, typically between 10 and 1000 examples, called a mini-batch. This reduces the computational cost of the algorithm compared to batch gradient descent, while also reducing the variance of the updates compared to SGD. Mini-batch gradient descent is widely used in deep learning because it strikes a good balance between convergence speed and stability.

4. Momentum Gradient Descent:

Momentum gradient descent is a variant of gradient descent that adds a momentum term to the update rule. The momentum term accumulates the gradient values over time and dampens the oscillations in the cost function, leading to faster convergence. This is particularly useful in cases where the cost function has a lot of noise or curvature, which can cause traditional gradient descent to get stuck in local minima.

5. Nesterov Accelerated Gradient (NAG):

Nesterov accelerated gradient is an extension of momentum gradient descent that takes into account the future gradient values when computing the momentum term. This helps to reduce overshooting and can lead to faster convergence than momentum gradient descent.

6. Adagrad:

Adagrad is a variant of gradient descent that adapts the learning rate for each parameter based on its historical gradient values. Parameters with large gradients have their learning rates reduced, while parameters with small gradients have their learning rates increased. This helps to normalize the updates and can be useful in cases where the cost function has a lot of curvature or different scales of gradients.

7. RMSProp:

RMSProp is a variant of gradient descent that also adapts the learning rate for each parameter, but instead of using the historical gradient values, it uses a moving average of the squared gradient values. This helps to reduce the learning rate for parameters that have large squared gradient values, which can cause the algorithm to oscillate or diverge.

8. Adam:

Adam, which stands for Adaptive Moment Estimation, is a variant of gradient descent that combines the ideas of Adagrad and RMSProp. It adapts the learning rate for each parameter based on the historical gradient values and also uses a moving average of the gradient values to compute the momentum term. Adam is one of the most widely used optimization algorithms in deep learning because it is efficient, stable, and robust to different types of cost functions and datasets.

These are just a few of the most popular variants of gradient descent. Each variant has its own strengths and weaknesses, and the choice of which variant to use depends on the specific problem and dataset at hand.

Pros and cons of each variant

Other related Articles:
implementing gradient descent in python (click here)
Implemention of particle swarm optimization (click here)
Implementing Gradient Descent In Quality Control to minimize “Defect rate” — Python (click here)