Softmax regression using TensorFlow



This article covers the basics of Softmax regression and how it is implemented in Python using the TensorFlow library.

What is Softmax Regression?

Softmax Regression (or Polynomial Logistic Regression ) is a generalization of logistic regression for the case where we want to handle multiple classes.

A short introduction to linear regression can be found here:
Understanding Logistic Regression

In binary logistic regression, we assumed that the labels were binary, those. for observation,

But consider a scenario in which we need to classify an observation from two or more class labels. For example, digital classification. Possible labels here:

In such cases we can use Softmax Regression .

Let`s define our model first :

  • Let the dataset have “m” features and “n” observations. In addition, there are labels of the class “k”, i.e. each observation can be classified as one of “k” possible targets. For example, if we have a dataset of 100 handwritten digital images with a vector size of 28 × 28 to classify digits, then n = 100, m = 28 × 28 = 784 and k = 10.

>

  • Functional matrix
    Functional matrix, , is represented as:

    Here, designation reads the values ​​ feature for See. The matrix has dimensions:

    • Weight matrix
      Define the weight matrix, as:

      Here, represents the weight assigned to feature for class label. The matrix has dimensions: , Initially the weight matrix is ​​filled using some normal distribution.

    • Logit Score Matrix
      We then define our pure input matrix (also called Logit Score Matrix ), , as:

      The matrix has dimensions: ,

      Currently, we are taking an extra column in feature matrix, and an extra row in weight matrix, . These extra columns and rows correspond to the bias terms associated with each prediction. This could be simplified by defining an extra matrix for bias, of size where . (In practice, all we need is a vector of size and some broadcasting techniques for the bias terms!)

      So, the final score matrix, is:

      where matrix has dimensions while has dimensions . But matrix still has the same value and dimensions!

      But what does the matrix mean? In fact, is the likelihood of a J label for observation. This is not a valid probability value, but can be viewed as a score assigned to each class label for each observation!

      Let`s define ourselves as a logit score vector for observation.

      For example, let the vector represents the score for each of the class labels in the ink classification task digits for observation. Here the maximum score is 5.2, which corresponds to the class label “3”. Therefore, our model currently predicts observation / image as “3”.

    • Softmax layer
      It is more difficult to train the model using score values ​​as they are difficult to differentiate when implementing a gradient descent algorithm to minimize the cost function. So, we need some function that normalizes the logit scores and also makes them easily differentiable! To transform the score matrix for the probabilities we use the Softmax function .

      For the vector softmax function is defined as:

      So the softmax function will do 2 things:

       1. convert all scores to probabilities. 2. sum of all probabilities is 1. 

      Recall that in the Binary Logistic classifier we used the sigmoid function for the same task. Softmax function — this is nothing more than a generalization of the sigmoid function! This softmax function now calculates the likelihood that The tutorial sample belongs to the class given the vector of logits as:

      In vector form, we can simply write:

      For simplicity, let`s say denote softmax probability vector for observation.

    • Hot-coded target matrix
      Since the softmax function provides us with the vector of probabilities of each class label for a given observation, we need to convert the target vector to the same format in order to compute the cost function ! According to each observation, there is a target vector (instead of a target value!) Consisting only of zeros and ones, where only the correct label is set to 1. This method is called one-touch encoding touch. See the diagram below for a better understanding:

      Now we define hot vector encoding for watching as

    • Cost function
      Now we need to define a cost function for which we must compare the softmax probabilities and the hot-coded target vector by subject of similarity. We use the concept of cross entropy for the same. Cross entropy — it is a distance function, that takes the calculated probabilities from the softmax function and the generated hot-coding matrix to calculate the distance. For correct target classes, the distance values ​​will be smaller, and the distance values ​​will be larger for the wrong target classes. We define cross entropy, for watching from a softmax probability vector, and one hot target vector, as:

      And now, the cost function, can be defined as average cross entropy, ie

      and the challenge is to minimize this cost function!

    • Gradient descent algorithm
      To examine our softmax model using gradient descent, we need to compute the derivative:

      and

      which we then use to update weights and offsets in the opposite direction of the gradient:

      and

      for each class where and — it is the speed of learning. Using this cost gradient, we iteratively update the weight matrix until we reach a certain number of epochs (passes through the training set) or the desired cost threshold.

    Implementation

    Now let`s implement Softmax regression in MNIST handwritten number set using TensorFlow library.

    For a detailed look at TensorFlow , follow this tutorial:

    import tensorflow as tf

    import numpy as np

    import matplotlib.pyplot as plt

    Step 2: Download data

    TensorFlow allows you to automatically download and read MNIST data. Consider the code below. It will download and save the data to the MNIST_data folder in the current project directory and load it into the current program.

    from tensorflow.examples.tutorials.mnist import input_data

    mnist = input_data.read_data_sets ( "MNIST_data /" , one_hot = True )

     Extracting MNIST_data / train -images-idx3-ubyte.gz Extracting MNIST_data / train-labels-idx1-ubyte.gz Extracting MNIST_data / t10k-images-idx3-ubyte.gz Extracting MNIST_data / t10k-labels-idx1-ubyte.gz 

    Step 3: Mon understanding the data

    Now we will try to understand the structure of the dataset.

    The MNIST data is divided into three parts: 55,000 training data points ( mnist.train >), 10,000 test data points ( mnist.test ) and 5,000 validation data points ( mnist.validation ).

    Each image is 28 by 28 pixels, which has been collapsed into a 784 one-dimensional array. The number of class labels is 10. Each target label is already provided in quick-coding form.

    print ( "Shape of feature matrix:" , mnist.train.images.shape)

    print ( " Shape of target matrix: " , mnist.train.labels.shape)

    print ( "One-hot encoding for 1st observation:" , mnist.train.labels [ 0 ])

     
    # render data by rendering

    fig, ax = plt.subplots ( 10 , 10 )

    k = 0

    for i in range ( 10 ):

    for j in range ( 10 ):

      ax [i] [j] .imshow (mnist.train.images [k] .reshape ( 28 , 28 ), aspect = `auto` )

      k + = 1

    plt.show ()

    Output:

     Shape of feature matrix: (55000, 784) Shape of target matrix: (55000, 10) One-hot encoding for 1st observation: [0. 0 . 0. 0. 0. 0. 0. 1. 0. 0.]   

    Step 4: Define the computation graph

    Now we create the computation graph.

    Some important points to pay attention to:

    • For the training data, we use a placeholder that will be passed at runtime with a mini training package. The technique of using minibatch to train a model using gradient descent is called stochastic gradient descent

      In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function. While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent. Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values ​​of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts i


    # number of features

    num_features = 784

    # number of target tags

    num_labels = 10

    # learning rate (alpha)

    learning_rate = 0.05

    # batch size

    batch_size = 128

    number of epochs

    num_steps = 5001

     
    # input data

    train_dataset = mnist.train.images

    train_labels = mnist.train.labels

    test_dataset = mnist.test.images

    test_labels = mnist.test.labels

    valid_dataset = mnist.validation.images

    valid_labels = mnist.validation.labels

      
    # initialize tensorflow graph

    graph = tf.Graph ()

     
    with graph.as_default ():

    "" "

    defining all nodes

      "" "

      

    # Inputs

    tf_train_dataset = tf.placeholder (tf.float32, shape = (batch_size, num_features))

    tf_train_labels = tf.placeholder (tf.float32, shape = (batch_size, num_labels))

      tf_valid_dataset = tf.constant (valid_dataset)

    tf_test_dataset = tf.constant (test_dataset)

     

    # Variables.

      weights = tf.Variable (tf.truncated_normal ([num_features, num_labels]))

    biases = tf.Variable (tf.zeros ([num_labels]))

     

    # Mock calculation.

      logits = tf.matmul (tf_train_dataset, weights) + biases

    loss = tf.reduce_mean (tf.nn.softmax_cross_entropy_with_logits (

    labels = tf_train_labels, logits = logits))

     

    # Optimizer.

    optimizer = tf.train.GradientDescentOptimizer (learning_rate) .minimize (loss)

      

    # Predictions for training, validation and test data.

      train_prediction = tf.nn.softmax (logits)

    valid_prediction = tf.nn.softmax (tf.matmul (tf_valid_dataset, weights) + biases)

                                                                                                                                                   test_prediction = tf.nn.softmax (tf.matmul (tf_test_dataset, weights) + biases)