This article covers the basics of Softmax regression and how it is implemented in Python using the TensorFlow library.
What is Softmax Regression?
Softmax Regression (or Polynomial Logistic Regression ) is a generalization of logistic regression for the case where we want to handle multiple classes.
A short introduction to linear regression can be found here:
Understanding Logistic Regression
In binary logistic regression, we assumed that the labels were binary, those. for observation,
But consider a scenario in which we need to classify an observation from two or more class labels. For example, digital classification. Possible labels here:
In such cases we can use Softmax Regression .
Let`s define our model first :
Here, designation reads the values feature for See. The matrix has dimensions:
Here, represents the weight assigned to feature for class label. The matrix has dimensions: , Initially the weight matrix is filled using some normal distribution.
The matrix has dimensions: ,
Currently, we are taking an extra column in feature matrix,
amp -img> and an extra row in weight matrix,where . These extra columns and rows correspond to the bias terms associated with each prediction. This could be simplified by defining an extra matrix for bias, of size . (In practice, all we need is a vector of size and some broadcasting techniques for the bias terms!)
So, the final score matrix,
amp- img> is:
matrix has dimensions while has dimensions . But matrix still has the same value and dimensions!
But what does the matrix mean? In fact, is the likelihood of a J label for observation. This is not a valid probability value, but can be viewed as a score assigned to each class label for each observation!
Let`s define ourselves as a logit score vector for observation.
For example, let the vector represents the score for each of the class labels in the ink classification task digits for observation. Here the maximum score is 5.2, which corresponds to the class label “3”. Therefore, our model currently predicts observation / image as “3”.
For the vector softmax function is defined as:
So the softmax function will do 2 things:
1. convert all scores to probabilities. 2. sum of all probabilities is 1.
Recall that in the Binary Logistic classifier we used the sigmoid function for the same task. Softmax function — this is nothing more than a generalization of the sigmoid function! This softmax function now calculates the likelihood that The tutorial sample belongs to the class given the vector of logits as:
In vector form, we can simply write:
For simplicity, let`s say denote softmax probability vector for observation.
Now we define hot vector encoding for watching as p>
And now, the cost function, can be defined as average cross entropy, ie
and the challenge is to minimize this cost function!
which we then use to update weights and offsets in the opposite direction of the gradient:
for each class where and — it is the speed of learning. Using this cost gradient, we iteratively update the weight matrix until we reach a certain number of epochs (passes through the training set) or the desired cost threshold.
Now let`s implement Softmax regression in MNIST handwritten number set using TensorFlow library.
For a detailed look at TensorFlow strong >, follow this tutorial: Step 2: Download data TensorFlow allows you to automatically download and read MNIST data. Consider the code below. It will download and save the data to the MNIST_data folder in the current project directory and load it into the current program.
tensorflow as tf
numpy as np
matplotlib.pyplot as plt
Step 2: Download data
TensorFlow allows you to automatically download and read MNIST data. Consider the code below. It will download and save the data to the MNIST_data folder in the current project directory and load it into the current program.
Extracting MNIST_data / train -images-idx3-ubyte.gz Extracting MNIST_data / train-labels-idx1-ubyte.gz Extracting MNIST_data / t10k-images-idx3-ubyte.gz Extracting MNIST_data / t10k-labels-idx1-ubyte.gz
Step 3: Mon understanding the data
Now we will try to understand the structure of the dataset.
The MNIST data is divided into three parts: 55,000 training data points ( mnist.train >), 10,000 test data points ( mnist.test ) and 5,000 validation data points ( mnist.validation ).
Each image is 28 by 28 pixels, which has been collapsed into a 784 one-dimensional array. The number of class labels is 10. Each target label is already provided in quick-coding form.
Shape of feature matrix: (55000, 784) Shape of target matrix: (55000, 10) One-hot encoding for 1st observation: [0. 0 . 0. 0. 0. 0. 0. 1. 0. 0.]
Step 4: Define the computation graph
Now we create the computation graph. p>