ML | Mini batch gradient descent with python



Depending on the number of training examples considered when updating model parameters, we have 3 types of gradient descent:

  1. Batch gradient descent: parameters are updated after the gradient is computed errors across the entire training set
  2. Stochastic Gradient Descent: the parameters are updated after the error gradient is computed with respect to one training case
  3. Mini Batch Gradient Descent : parameters are updated after calculating the error gradient relative to a subset of the training set

Batch Gradient Descent Stochastic Gradient Descent Mini-Batch Gradient Descent
Since entire training data is considered before taking a step in the direction of gradient, therefore it takes a lot of time for making a single update.

Since onl ya single training example is considered before taking a step in the direction of gradient, we are forced to loop over the training set and thus cannot exploit the speed associated with vectorizing the code. Since a subset of training examples is considered, it can make quick updates in the model parameters and can also exploit the speed associated with vectorizing the code.
It makes smooth updates in the model parameters It makes very noisy updates in the parameters Depending upon the batch size, the updates can be made less noisy – greater the batch size less noisy is the update

So mini-batch gradient descent makes a trade-off between fast convergence and noise associated with gradient updates, making it a more flexible and robust algorithm .

Mini Batch Gradient Descent:

Algorithm-

Let theta = model parameters an d max_iters = number of epochs.

for itr = 1, 2, 3,…, max_iters:
 for mini_batch (X_mini, y_mini):

  • Forward Pass on the batch X_mini:
    • Make predictions on the mini-batch
    • Compute error in predictions (J (theta)) with the current values ​​of the parameters
    • Backward Pass:
      • Compute gradient (theta) = partial derivative of J (theta) wrt theta
    • Update parameters:
      • theta = theta – learning_rate * gradient (theta)

Below is the Python implementation:

Step # 1: The first step — import dependencies, generate linear regression data, and visualize the generated data. We have created 8000 sample data, each with 2 attributes / functions. These sample data are further subdivided into a training set (X_train, y_train) and a test set (X_test, y_test), which have 7200 and 800 examples, respectively.

# import dependencies

import numpy as np

import matplotlib.pyplot as plt

 
# data creation

mean = np.array ([ 5.0 , 6.0 ])

cov = np.array ([[ 1.0 , 0.95 ], [ 0.95 , 1.2 ]])

data = np.random.multivariate_normal (mean, cov, 8000 )

 
# data visualization

plt.scatter (data [: 500 , 0 ], data [: 500 , 1 ], marker = `.` )

plt.show ()

 
# train- test-split

data = np. hstack ((np.ones ((data.shape [ 0 ], 1 )), data))

 

split_factor = 0.90

split = int (split_factor * data.shape [ 0 ])

 

X_train = data [: split,: - 1 ]

y_train = data [: split, - 1 ]. reshape (( - 1 , 1 ))

X_test = data [split :,: - 1 ]

y_test = data [split :, - 1 ]. reshape (( - 1 , 1 ))

  

print ( "Number of examples in training set =% d " % (X_train.shape [ 0 ]))

print ( "Number of examples in testing set =% d" % (X_test. shape [ 0 ]))

Exit:

Number of examples in training set = 7200
Number of examples in test set = 800

Step # 2: Next, we write the code to implement linear regression using mini-batch gradient descent.
gradientDescent () is the main function of the driver and the other functions are helper functions used to predict — hypothesis () , calculating gradients — gradient () , error computation — cost () and create mini-packages — create_mini_batches () . The driver function initializes the parameters, calculates the best set of parameters for the model, and returns those parameters along with a list containing the error history when the parameters were updated.

# linear regression using “ mini-batch "gradient descent"
# function for calculating hypotheses / predictions

def hypothesis (X, theta):

return np.dot (X, theta)

 
# function to compute the error gradient with theta function

def gradient ( X, y, theta):

h = hypothesis (X, theta)

grad = np.dot (X.transpose (), (h - y))

return grad

 
# function to calculate the error for the current theta values ​​

def cost (X, y, theta):

h = hypothesis (X, theta)

  J = np.dot ((h - y) .trans pose (), (h - y))

  J / = 2

return J [ 0 ]

 
# function to create a list containing mini-packages

def create_mini_batches (X, y, batch_size):

mini_batches = []

data = np.hstack ((X, y))

  np.random.shuffle (data)

n_minibatches = data.shape [ 0 ] / / batch_size

i = 0

  

  for i in range (n_minibatches + 1 ):

mini_batch = data [i * batch_size: (i + 1 ) * batch_size,:]

X_mini = mini_batch [:,: - 1 ]

Y_mini = mini_batch [:, - 1 ]. reshape (( - 1 , 1 ))

  mini_batches.append ((X_mini, Y_mini))

  if data.shape [ 0 ] % batch_size! = 0 :

mini_batch = data [i * batch_size: data.shape [ 0 ]]

X_mini = mini_batch [:,: - 1 ]

Y_mini = mini_batch [:, - 1 ]. reshape (( - 1 , 1 ) )

mini_batches.append ((X_mini, Y_mini))

return mini_batches

 
# function to perform mini-gradient descent

def gradientDescent (X, y, learning_rate = 0.001 , batch_size = 32 ):

theta = np.zeros ((X.shape [ 1 ], 1 ))

error_list = []

  max_iters = 3

for itr in range (max_iters):

mini_batches = create_mini_batches (X, y, batch_size)

for mini_batch in mini_batches:

X_mini, y_mini = mini_batch

theta = theta - learning_rate * gradient (X_mini, y_mini, theta)

error_list.append (cost (X_mini, y_mini, theta))

 

  return theta, error_list

Function call gradientDescent () to compute the model parameters (theta) and visualize the change in the error function.

theta, error_list = gradientDescent (X_train, y_train)

print ( "Bias =" , theta [ 0 ])

print ( "Coefficients =" , theta [ 1 :])

 
# render gradient descent
plt.plot (error_list)

plt.xlabel ( "Number of iterations" )

plt.ylabel ( "Cost" )

plt.show ()

Output:
Offset = [0.81830471]
Odds = [[1.04586595] ]

Step # 3: Finally, we make predictions on the test set and calculate the mean absolute error in the predictions.

# predicting exit for X_test

y_pred = hypothesis (X_test, theta )

plt.scatter (X_test [:, 1 ], y_test [:,], marker = ` .` )

plt.plot (X_test [:, 1 ], y_pred, color = `orange` )

plt .show ()

 
# calculating error in forecasts

error = np. sum (np. abs (y_test - y_pred) / y_test .shape [ 0 ])

print ( "Mean absolute error =" , error)

Exit:

Mean absolute error = 0.4366644295854125

The orange line represents the final function of the hypothesis: theta [0] + theta [1] * X_test [:, 1] + theta [2] * X_test [:, 2] = 0