This article covers the basics of Softmax regression and how it is implemented in Python using the TensorFlow library.
What is Softmax Regression?
Softmax Regression (or Polynomial Logistic Regression ) is a generalization of logistic regression for the case where we want to handle multiple classes.
A short introduction to linear regression can be found here:
Understanding Logistic Regression
In binary logistic regression, we assumed that the labels were binary, those. for observation,
But consider a scenario in which we need to classify an observation from two or more class labels. For example, digital classification. Possible labels here:
In such cases we can use Softmax Regression .
Let’s define our model first :
 Let the dataset have "m" features and "n" observations. In addition, there are labels of the class "k", i.e. each observation can be classified as one of "k" possible targets. For example, if we have a dataset of 100 handwritten digital images with a vector size of 28 × 28 to classify digits, then n = 100, m = 28 × 28 = 784 and k = 10.
Functional matrix, , is represented as:
Here, designation reads the values feature for See. The matrix has dimensions:
 Weight matrix
Define the weight matrix, as:
Here, represents the weight assigned to feature for class label. The matrix has dimensions: , Initially the weight matrix is filled using some normal distribution.
 Logit Score Matrix
We then define our pure input matrix (also called Logit Score Matrix ), , as:
The matrix has dimensions: ,
Currently, we are taking an extra column in feature matrix,
and an extra row in weight matrix, whereSo, the final score matrix,
is: where
has dimensions But what does the matrix mean? In fact, is the likelihood of a J label for observation. This is not a valid probability value, but can be viewed as a score assigned to each class label for each observation!
Let’s define ourselves as a logit score vector for observation.
For example, let the vector represents the score for each of the class labels in the ink classification task digits for observation. Here the maximum score is 5.2, which corresponds to the class label "3". Therefore, our model currently predicts observation / image as "3".
 Softmax layer
It is more difficult to train the model using score values as they are difficult to differentiate when implementing a gradient descent algorithm to minimize the cost function. So, we need some function that normalizes the logit scores and also makes them easily differentiable! To transform the score matrix for the probabilities we use the Softmax function .For the vector softmax function is defined as:
So the softmax function will do 2 things:
1. convert all scores to probabilities. 2. sum of all probabilities is 1.
Recall that in the Binary Logistic classifier we used the sigmoid function for the same task. Softmax function — this is nothing more than a generalization of the sigmoid function! This softmax function now calculates the likelihood that The tutorial sample belongs to the class given the vector of logits as:
In vector form, we can simply write:
For simplicity, let’s say denote softmax probability vector for observation.
 Hotcoded target matrix
Since the softmax function provides us with the vector of probabilities of each class label for a given observation, we need to convert the target vector to the same format in order to compute the cost function ! According to each observation, there is a target vector (instead of a target value!) Consisting only of zeros and ones, where only the correct label is set to 1. This method is called onetouch encoding touch. See the diagram below for a better understanding:Now we define hot vector encoding for watching as
 Cost function
Now we need to define a cost function for which we must compare the softmax probabilities and the hotcoded target vector by subject of similarity. We use the concept of cross entropy for the same. Cross entropy — it is a distance function, that takes the calculated probabilities from the softmax function and the generated hotcoding matrix to calculate the distance. For correct target classes, the distance values will be smaller, and the distance values will be larger for the wrong target classes. We define cross entropy, for watching from a softmax probability vector, and one hot target vector, as:
And now, the cost function, can be defined as average cross entropy, ie
and the challenge is to minimize this cost function!
 Gradient descent algorithm
To examine our softmax model using gradient descent, we need to compute the derivative:
and
which we then use to update weights and offsets in the opposite direction of the gradient:
and
for each class where and — it is the speed of learning. Using this cost gradient, we iteratively update the weight matrix until we reach a certain number of epochs (passes through the training set) or the desired cost threshold.
Implementation
Now let’s implement Softmax regression in MNIST handwritten number set using TensorFlow library.
For a detailed look at TensorFlow , follow this tutorial:
import
numpy as np
import
matplotlib.pyplot as plt
Step 2: Download data
TensorFlow allows you to automatically download and read MNIST data. Consider the code below. It will download and save the data to the MNIST_data folder in the current project directory and load it into the current program.
from
tensorflow.examples.tutorials.mnist
import
input_data
mnist
=
input_data.read_data_sets (
"MNIST_data /"
, one_hot
=
True
)
Extracting MNIST_data / train imagesidx3ubyte.gz Extracting MNIST_data / trainlabelsidx1ubyte.gz Extracting MNIST_data / t10kimagesidx3ubyte.gz Extracting MNIST_data / t10klabelsidx1ubyte.gz
Step 3: Mon understanding the data
Now we will try to understand the structure of the dataset.
The MNIST data is divided into three parts: 55,000 training data points ( mnist.train >), 10,000 test data points ( mnist.test ) and 5,000 validation data points ( mnist.validation ).
Each image is 28 by 28 pixels, which has been collapsed into a 784 onedimensional array. The number of class labels is 10. Each target label is already provided in quickcoding form.

Output:
Shape of feature matrix: (55000, 784) Shape of target matrix: (55000, 10) Onehot encoding for 1st observation: [0. 0 . 0. 0. 0. 0. 0. 1. 0. 0.]
Step 4: Define the computation graph
Now we create the computation graph.
Some important points to pay attention to:
 For the training data, we use a placeholder that will be passed at runtime with a mini training package. The technique of using minibatch to train a model using gradient descent is called stochastic gradient descent .
In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function. While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent. Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts i
Shop
Latest questions
Wiki
