This article covers the basics of Softmax regression and how it is implemented in Python using the TensorFlow library.
What is Softmax Regression?
Softmax Regression (or Polynomial Logistic Regression ) is a generalization of logistic regression for the case where we want to handle multiple classes.
A short introduction to linear regression can be found here:
Understanding Logistic Regression
In binary logistic regression, we assumed that the labels were binary, those. for observation,
But consider a scenario in which we need to classify an observation from two or more class labels. For example, digital classification. Possible labels here:
In such cases we can use Softmax Regression .
Let`s define our model first :
>
Here, designation reads the values feature for See. The matrix has dimensions:
Here, represents the weight assigned to feature for class label. The matrix has dimensions: , Initially the weight matrix is filled using some normal distribution.
The matrix has dimensions: ,
Currently, we are taking an extra column in feature matrix,
and an extra row in weight matrix, where. These extra columns and rows correspond to the bias terms associated with each prediction. This could be simplified by defining an extra matrix for bias, of size . (In practice, all we need is a vector of size and some broadcasting techniques for the bias terms!) So, the final score matrix,
is: where
matrix has dimensions while has dimensions . But matrix still has the same value and dimensions!
But what does the matrix mean? In fact, is the likelihood of a J label for observation. This is not a valid probability value, but can be viewed as a score assigned to each class label for each observation!
Let`s define ourselves as a logit score vector for observation.
For example, let the vector represents the score for each of the class labels in the ink classification task digits for observation. Here the maximum score is 5.2, which corresponds to the class label “3”. Therefore, our model currently predicts observation / image as “3”.
For the vector softmax function is defined as:
So the softmax function will do 2 things:
1. convert all scores to probabilities. 2. sum of all probabilities is 1.
Recall that in the Binary Logistic classifier we used the sigmoid function for the same task. Softmax function — this is nothing more than a generalization of the sigmoid function! This softmax function now calculates the likelihood that The tutorial sample belongs to the class given the vector of logits as:
In vector form, we can simply write:
For simplicity, let`s say denote softmax probability vector for observation.
Now we define hot vector encoding for watching as
And now, the cost function, can be defined as average cross entropy, ie
and the challenge is to minimize this cost function!
and
which we then use to update weights and offsets in the opposite direction of the gradient:
and
for each class where and — it is the speed of learning. Using this cost gradient, we iteratively update the weight matrix until we reach a certain number of epochs (passes through the training set) or the desired cost threshold.
Implementation
Now let`s implement Softmax regression in MNIST handwritten number set using TensorFlow library.
For a detailed look at TensorFlow , follow this tutorial:
import
numpy as np
import
matplotlib.pyplot as plt
Step 2: Download data
TensorFlow allows you to automatically download and read MNIST data. Consider the code below. It will download and save the data to the MNIST_data folder in the current project directory and load it into the current program.
from
tensorflow.examples.tutorials.mnist
import
input_data
mnist
=
input_data.read_data_sets (
"MNIST_data /"
, one_hot
=
True
)
Extracting MNIST_data / train imagesidx3ubyte.gz Extracting MNIST_data / trainlabelsidx1ubyte.gz Extracting MNIST_data / t10kimagesidx3ubyte.gz Extracting MNIST_data / t10klabelsidx1ubyte.gz
Step 3: Mon understanding the data
Now we will try to understand the structure of the dataset.
The MNIST data is divided into three parts: 55,000 training data points ( mnist.train >), 10,000 test data points ( mnist.test ) and 5,000 validation data points ( mnist.validation ).
Each image is 28 by 28 pixels, which has been collapsed into a 784 onedimensional array. The number of class labels is 10. Each target label is already provided in quickcoding form.

Output:
Shape of feature matrix: (55000, 784) Shape of target matrix: (55000, 10) Onehot encoding for 1st observation: [0. 0 . 0. 0. 0. 0. 0. 1. 0. 0.]
Step 4: Define the computation graph
Now we create the computation graph.
Some important points to pay attention to:
In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function. While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent. Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts i
XSubmit new EBook
