Depending on the number of training examples considered when updating model parameters, we have 3 types of gradient descent:
- Batch gradient descent: parameters are updated after the gradient is computed errors across the entire training set
- Stochastic Gradient Descent: the parameters are updated after the error gradient is computed with respect to one training case
- Mini Batch Gradient Descent : parameters are updated after calculating the error gradient relative to a subset of the training set
|Batch Gradient Descent||Stochastic Gradient Descent||Mini-Batch Gradient Descent|
|Since entire training data is considered before taking a step in the direction of gradient, therefore it takes a lot of time for making a single update.||Since onl ya single training example is considered before taking a step in the direction of gradient, we are forced to loop over the training set and thus cannot exploit the speed associated with vectorizing the code.||Since a subset of training examples is considered, it can make quick updates in the model parameters and can also exploit the speed associated with vectorizing the code.|
|It makes smooth updates in the model parameters||It makes very noisy updates in the parameters||Depending upon the batch size, the updates can be made less noisy - greater the batch size less noisy is the update|
So mini-batch gradient descent makes a trade-off between fast convergence and noise associated with gradient updates, making it a more flexible and robust algorithm .
Mini Batch Gradient Descent:
Let theta = model parameters an d max_iters = number of epochs.
for itr = 1, 2, 3,…, max_iters:
for mini_batch (X_mini, y_mini):
- Forward Pass on the batch X_mini:
- Make predictions on the mini-batch
- Compute error in predictions (J (theta)) with the current values of the parameters
- Backward Pass:
- Compute gradient (theta) = partial derivative of J (theta) wrt theta
- Update parameters:
- theta = theta - learning_rate * gradient (theta)
Below is the Python implementation:
Step # 1: The first step — import dependencies, generate linear regression data, and visualize the generated data. We have created 8000 sample data, each with 2 attributes / functions. These sample data are further subdivided into a training set (X_train, y_train) and a test set (X_test, y_test), which have 7200 and 800 examples, respectively.
Number of examples in training set = 7200
Number of examples in test set = 800
Step # 2: Next, we write the code to implement linear regression using mini-batch gradient descent.
gradientDescent () is the main function of the driver and the other functions are helper functions used to predict —
hypothesis () , calculating gradients —
gradient () , error computation —
cost () and create mini-packages —
create_mini_batches () . The driver function initializes the parameters, calculates the best set of parameters for the model, and returns those parameters along with a list containing the error history when the parameters were updated.
gradientDescent () to compute the model parameters (theta) and visualize the change in the error function.
gradientDescent (X_train, y_train)
, theta [
, theta [
# render gradient descent
"Number of iterations"
Offset = [0.81830471]
Odds = [[1.04586595] ]
Step # 3: Finally, we make predictions on the test set and calculate the mean absolute error in the predictions.
Mean absolute error = 0.4366644295854125
The orange line represents the final function of the hypothesis: theta  + theta  * X_test [:, 1] + theta  * X_test [:, 2] = 0