Change language

# Softmax regression using TensorFlow

| |

This article covers the basics of Softmax regression and how it is implemented in Python using the TensorFlow library.

What is Softmax Regression?

Softmax Regression (or Polynomial Logistic Regression ) is a generalization of logistic regression for the case where we want to handle multiple classes.

A short introduction to linear regression can be found here:
Understanding Logistic Regression

In binary logistic regression, we assumed that the labels were binary, those. for observation,

But consider a scenario in which we need to classify an observation from two or more class labels. For example, digital classification. Possible labels here:

In such cases we can use Softmax Regression .

Let’s define our model first :

• Let the dataset have "m" features and "n" observations. In addition, there are labels of the class "k", i.e. each observation can be classified as one of "k" possible targets. For example, if we have a dataset of 100 handwritten digital images with a vector size of 28 × 28 to classify digits, then n = 100, m = 28 × 28 = 784 and k = 10.
>
• Functional matrix
Functional matrix, , is represented as:

Here, designation reads the values ​​ feature for See. The matrix has dimensions:

• Weight matrix
Define the weight matrix, as:

Here, represents the weight assigned to feature for class label. The matrix has dimensions: , Initially the weight matrix is ​​filled using some normal distribution.

• Logit Score Matrix
We then define our pure input matrix (also called Logit Score Matrix ), , as:

The matrix has dimensions: ,

Currently, we are taking an extra column in feature matrix, and an extra row in weight matrix, . These extra columns and rows correspond to the bias terms associated with each prediction. This could be simplified by defining an extra matrix for bias, of size where . (In practice, all we need is a vector of size and some broadcasting techniques for the bias terms!)

So, the final score matrix, is:

where matrix has dimensions while has dimensions . But matrix still has the same value and dimensions!

But what does the matrix mean? In fact, is the likelihood of a J label for observation. This is not a valid probability value, but can be viewed as a score assigned to each class label for each observation!

Let’s define ourselves as a logit score vector for observation.

For example, let the vector represents the score for each of the class labels in the ink classification task digits for observation. Here the maximum score is 5.2, which corresponds to the class label "3". Therefore, our model currently predicts observation / image as "3".

• Softmax layer
It is more difficult to train the model using score values ​​as they are difficult to differentiate when implementing a gradient descent algorithm to minimize the cost function. So, we need some function that normalizes the logit scores and also makes them easily differentiable! To transform the score matrix for the probabilities we use the Softmax function .

For the vector softmax function is defined as:

So the softmax function will do 2 things:

` 1. convert all scores to probabilities. 2. sum of all probabilities is 1. `

Recall that in the Binary Logistic classifier we used the sigmoid function for the same task. Softmax function — this is nothing more than a generalization of the sigmoid function! This softmax function now calculates the likelihood that The tutorial sample belongs to the class given the vector of logits as:

In vector form, we can simply write:

For simplicity, let’s say denote softmax probability vector for observation.

• Hot-coded target matrix
Since the softmax function provides us with the vector of probabilities of each class label for a given observation, we need to convert the target vector to the same format in order to compute the cost function ! According to each observation, there is a target vector (instead of a target value!) Consisting only of zeros and ones, where only the correct label is set to 1. This method is called one-touch encoding touch. See the diagram below for a better understanding:

Now we define hot vector encoding for watching as

• Cost function
Now we need to define a cost function for which we must compare the softmax probabilities and the hot-coded target vector by subject of similarity. We use the concept of cross entropy for the same. Cross entropy — it is a distance function, that takes the calculated probabilities from the softmax function and the generated hot-coding matrix to calculate the distance. For correct target classes, the distance values ​​will be smaller, and the distance values ​​will be larger for the wrong target classes. We define cross entropy, for watching from a softmax probability vector, and one hot target vector, as:

And now, the cost function, can be defined as average cross entropy, ie

and the challenge is to minimize this cost function!

To examine our softmax model using gradient descent, we need to compute the derivative:

and

which we then use to update weights and offsets in the opposite direction of the gradient:

and

for each class where and — it is the speed of learning. Using this cost gradient, we iteratively update the weight matrix until we reach a certain number of epochs (passes through the training set) or the desired cost threshold.

Implementation

Now let’s implement Softmax regression in MNIST handwritten number set using TensorFlow library.

For a detailed look at TensorFlow , follow this tutorial:

` import ` ` tensorflow as tf `

` import ` ` numpy as np `

` import ` ` matplotlib.pyplot as plt `

TensorFlow allows you to automatically download and read MNIST data. Consider the code below. It will download and save the data to the MNIST_data folder in the current project directory and load it into the current program.

` `

` from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets ( "MNIST_data /" , one_hot = True ) `

` Extracting MNIST_data / train -images-idx3-ubyte.gz Extracting MNIST_data / train-labels-idx1-ubyte.gz Extracting MNIST_data / t10k-images-idx3-ubyte.gz Extracting MNIST_data / t10k-labels-idx1-ubyte.gz `

Step 3: Mon understanding the data

Now we will try to understand the structure of the dataset.

The MNIST data is divided into three parts: 55,000 training data points ( mnist.train >), 10,000 test data points ( mnist.test ) and 5,000 validation data points ( mnist.validation ).

Each image is 28 by 28 pixels, which has been collapsed into a 784 one-dimensional array. The number of class labels is 10. Each target label is already provided in quick-coding form.

 ` print ` ` (` ` "Shape of feature matrix:" ` `, mnist.train.images.shape) ` ` print ` ` (` `" Shape of target matrix: "` `, mnist.train.labels.shape) ` ` print ` ` (` ` "One-hot encoding for 1st observation:" ` `, mnist.train.labels [` ` 0 ` `]) `   ` # render data by rendering ` ` fig, ax ` ` = ` ` plt.subplots (` ` 10 ` `, ` ` 10 ` `) ` ` k ` ` = ` ` 0 ` ` for ` ` i ` ` in ` ` range ` ` (` ` 10 ` `): ` ` for j in range ( 10 ): ``   ax [i] [j] .imshow (mnist.train.images [k] .reshape ( 28 , 28 ), aspect = ’auto’ )   k + = 1 plt.show () `

Output:

` Shape of feature matrix: (55000, 784) Shape of target matrix: (55000, 10) One-hot encoding for 1st observation: [0. 0 . 0. 0. 0. 0. 0. 1. 0. 0.]   `

Step 4: Define the computation graph

Now we create the computation graph.

Some important points to pay attention to:

• For the training data, we use a placeholder that will be passed at runtime with a mini training package. The technique of using minibatch to train a model using gradient descent is called stochastic gradient descent

In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function. While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent. Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values ​​of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts i

## Shop

Learn programming in R: courses

\$

Best Python online courses for 2022

\$

Best laptop for Fortnite

\$

Best laptop for Excel

\$

Best laptop for Solidworks

\$

Best laptop for Roblox

\$

Best computer for crypto mining

\$

Best laptop for Sims 4

\$

Latest questions

NUMPYNUMPY

psycopg2: insert multiple rows with one query

NUMPYNUMPY

How to convert Nonetype to int or string?

NUMPYNUMPY

How to specify multiple return types using type-hints

NUMPYNUMPY

Javascript Error: IPython is not defined in JupyterLab

## Wiki

Python OpenCV | cv2.putText () method

numpy.arctan2 () in Python

Python | os.path.realpath () method

Python OpenCV | cv2.circle () method

Python OpenCV cv2.cvtColor () method

Python - Move item to the end of the list

time.perf_counter () function in Python

Check if one list is a subset of another in Python

Python os.path.join () method

 ` # number of features ` ` num_features ` ` = ` ` 784 ` ` # number of target tags ` ` num_labels ` ` = ` ` 10 ` ` # learning rate (alpha) ` ` learning_rate ` ` = ` ` 0.05 ` ` # batch size ` ` batch_size ` ` = ` ` 128 ` ` number of epochs ` ` num_steps ` ` = ` ` 5001 `   ` # input data ` ` train_dataset ` ` = ` ` mnist.train.images ` ` train_labels ` ` = ` ` mnist.train.labels ` ` test_dataset ` ` = ` ` mnist.test.images ` ` test_labels ` ` = ` ` mnist.test.labels ` ` valid_dataset ` ` = ` ` mnist.validation.images ` ` valid_labels ` ` = ` ` mnist.validation.labels ` ` `  ` # initialize tensorflow graph ` ` graph ` ` = ` ` tf.Graph () `   ` with graph.as_default (): ` ` "" "` ` defining all nodes ` ` ` ` "" "` ` `  ` # Inputs ` ` tf_train_dataset ` ` = ` ` tf.placeholder (tf.float32, shape ` ` = ` ` (batch_size, num_features)) ` ` tf_train_labels ` ` = ` ` tf.placeholder (tf.float32, shape ` ` = ` ` (batch_size, num_labels)) ` ` ` ` tf_valid_dataset ` ` = ` ` tf.constant (valid_dataset) ` ` tf_test_dataset ` ` = ` ` tf.constant (test_dataset) `   ` # Variables. ` ` ` ` weights ` ` = ` ` tf.Variable (tf.truncated_normal ([num_features, num_labels])) ` ` biases ` ` = ` ` tf.Variable (tf.zeros ([num_labels])) `   ` # Mock calculation. ` ` ` ` logits ` ` = ` ` tf.matmul (tf_train_dataset, weights) ` ` + ` ` biases ` ` loss ` ` = ` ` tf.reduce_mean (tf.nn.softmax_cross_entropy_with_logits (` ` labels ` ` = ` ` tf_train_labels, logits ` ` = ` ` logits)) `   ` # Optimizer. ` ` optimizer ` ` = ` ` tf.train.GradientDescentOptimizer (learning_rate) .minimize (loss) ` ` `  ` # Predictions for training, validation and test data. ` ` ` ` train_prediction ` ` = ` ` tf.nn.softmax (logits) ` ` valid_prediction ` ` = ` ` tf.nn.softmax (tf.matmul (tf_valid_dataset, weights) ` ` + ` ` biases) ` `                                                                                                                                              ` ` test_prediction ` ` = ` ` tf.nn.softmax (tf.matmul (tf_test_dataset, weights) ` ` + ` ` biases) `