Implementing Gradient descent in python

6 min readFeb 21, 2023

Different variants of gradient descent (Click here)
Implemention of particle swarm optimization (click here)
Implementing Gradient Descent In Quality Control to minimize “Defect rate” — Python (click here)

Gradient descent is a popular optimization algorithm used in machine learning and deep learning to find the optimal parameters or weights for a given model. The goal of gradient descent is to minimize the cost function, which measures the difference between the predicted output of the model and the actual output.

The algorithm works by iteratively adjusting the parameters of the model in the direction of the steepest descent of the cost function gradient until a minimum is reached. The gradient is computed by taking the partial derivatives of the cost function with respect to each parameter.

There are three main variants of gradient descent:

Batch Gradient Descent: In this variant, the gradient is computed over the entire dataset, and the parameters are updated after each epoch.
Stochastic Gradient Descent: In this variant, the gradient is computed on a single training example, and the parameters are updated after each example.
Mini-Batch Gradient Descent: In this variant, the gradient is computed on a small subset of the training data, and the parameters are updated after each mini-batch.

Gradient descent is used in various machine learning applications, such as linear regression, logistic regression, and neural networks, to optimize the model’s parameters and improve its accuracy. It is a fundamental algorithm in machine learning and is essential for training complex models with large amounts of data.

Formula:

During each iteration of gradient descent, the parameters θ are updated according to the above formula, where ∇J(θ) is evaluated using the current values of θ. This means that in each iteration, the algorithm takes a step in the direction of the steepest descent of the cost function, with a step size determined by the learning rate. The learning rate determines the size of the step taken at each iteration and needs to be carefully chosen to ensure that the algorithm converges to the optimal solution.

Practical use case of Gradient Descent:

Gradient descent is a fundamental optimization algorithm in machine learning and has numerous practical use cases. Here are some examples:

Linear Regression: In linear regression, gradient descent is used to find the optimal coefficients that minimize the sum of squared errors between the predicted and actual values.
Logistic Regression: In logistic regression, gradient descent is used to find the optimal parameters that minimize the cross-entropy loss function, which measures the difference between the predicted probabilities and actual labels.
Neural Networks: In deep learning, gradient descent is used to optimize the weights and biases of the neural network by minimizing the loss function, which measures the difference between the predicted and actual outputs.
Support Vector Machines (SVMs): In SVMs, gradient descent is used to find the optimal hyperplane that separates the data points into different classes with maximum margin.
Dimensionality Reduction: In techniques such as Principal Component Analysis (PCA), gradient descent is used to find the optimal eigenvectors that capture the maximum variance in the data.
Clustering: In clustering algorithms such as k-means, gradient descent is used to optimize the centroids of the clusters by minimizing the sum of squared distances between the data points and their assigned cluster centroids.

Overall, gradient descent is a versatile optimization algorithm that is widely used in various machine learning applications to find the optimal parameters of the model and improve its accuracy.

Implementation Steps:

Choose a model and a cost function:

Select a model that you want to optimize, such as linear regression, logistic regression, or a neural network.
Choose a cost function that measures the difference between the predicted output and the actual output, such as mean squared error, cross-entropy loss, or binary log loss.

2. Initialize the parameters:

Set the initial values for the parameters that you want to optimize, such as the weights and biases of the model.

3. Compute the gradient:

Calculate the gradient of the cost function with respect to each parameter by taking the partial derivative of the cost function with respect to each parameter.

4. Update the parameters:

Adjust the parameters in the direction of the negative gradient by multiplying it with a learning rate, which controls the size of the update.

5. Repeat until convergence:

Iterate the above three steps until the cost function converges to a minimum or a satisfactory threshold, such as a small change in the cost function between iterations.

6. Evaluate the model:

Test the trained model on a separate set of data to evaluate its performance, such as the accuracy, precision, recall, or F1 score.

Note that there are different variants of gradient descent, such as batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, which have different computational and convergence properties. The implementation details may also vary depending on the specific model and library used.

Implementation:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Define the function to be minimized (a simple quadratic function)
def f(x, y):
    return x**2 + y**2

# Define the partial derivatives of the function with respect to x and y
def df_dx(x, y):
    return 2 * x

def df_dy(x, y):
    return 2 * y

# Define the gradient descent algorithm
def gradient_descent(start_x, start_y, learning_rate, num_iterations):
    # Initialize the parameters
    x = start_x
    y = start_y
    history = []
    
    # Perform the gradient descent iterations
    for i in range(num_iterations):
        # Calculate the gradients
        grad_x = df_dx(x, y)
        grad_y = df_dy(x, y)
        
        # Update the parameters
        x = x - learning_rate * grad_x
        y = y - learning_rate * grad_y
        
        # Save the history of the parameters
        history.append((x, y, f(x, y)))
    
    return x, y, f(x, y), history

# Define the meshgrid for plotting the function
x_range = np.arange(-10, 10, 0.1)
y_range = np.arange(-10, 10, 0.1)
X, Y = np.meshgrid(x_range, y_range)
Z = f(X, Y)

# Perform gradient descent and plot the results
start_x, start_y = 8, 8
learning_rate = 0.1
num_iterations = 20
x_opt, y_opt, f_opt, history = gradient_descent(start_x, start_y, learning_rate, num_iterations)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z, cmap='coolwarm')
ax.scatter(*zip(*history), c='r', marker='o')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('f(x, y)')
plt.show()

This implementation defines a simple quadratic function f(x, y) = x^2 + y^2 and its partial derivatives df_dx(x, y) = 2x and df_dy(x, y) = 2y. It then defines the gradient_descent() function, which takes the starting point (start_x, start_y), learning rate learning_rate, and the number of iterations num_iterations as inputs and returns the optimal point (x_opt, y_opt) and the minimum value f_opt of the function, as well as the history of the parameter values history during the iterations. The mesh grid is defined for plotting the function and the results of the gradient descent algorithm are plotted in a 3D graph using matplotlib

Pros:

Flexibility: Gradient descent can be used with different types of models and loss functions, making it a versatile optimization algorithm.
Efficiency: Gradient descent is computationally efficient and can handle large datasets with numerous features.
Convergence: Gradient descent guarantees convergence to a minimum, given a sufficiently small learning rate and enough iterations.
Scalability: Gradient descent can be parallelized across multiple processors or nodes, enabling faster training times.

Cons:

Sensitivity to Learning Rate: The performance of gradient descent is highly sensitive to the choice of the learning rate, which can be difficult to tune.
Local Minima: Gradient descent can get trapped in local minima, which may not be the global minimum.
Overfitting: Gradient descent can overfit the training data if the regularization is not applied or if the model is too complex.
Scaling: Gradient descent may require feature scaling to ensure that each feature contributes equally to the gradient, which can be a time-consuming preprocessing step.

Overall, gradient descent is a powerful optimization algorithm with many advantages, but its performance can be impacted by various factors, including the learning rate, the choice of optimization algorithm, and the complexity of the model.

Other related Articles —
Different variants of gradient descent (Click here)
Implemention of particle swarm optimization (click here)
Implementing Gradient Descent In Quality Control to minimize “Defect rate” — Python (click here)

Implementing Gradient descent in python

Formula:

Practical use case of Gradient Descent:

Implementation Steps:

Implementation:

Written by Induraj

Responses (1)