StackOverflow

### Answer rating: 158

### Answer rating: 126

### Answer rating: 61

From the Udacity"s deep learning class, the softmax of y_i is simply the exponential divided by the sum of exponential of the whole Y vector:

Where `S(y_i)`

is the softmax function of `y_i`

and `e`

is the exponential and `j`

is the no. of columns in the input vector Y.

I"ve tried the following:

```
import numpy as np
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
scores = [3.0, 1.0, 0.2]
print(softmax(scores))
```

which returns:

```
[ 0.8360188 0.11314284 0.05083836]
```

But the suggested solution was:

```
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
return np.exp(x) / np.sum(np.exp(x), axis=0)
```

which produces the **same output as the first implementation**, even though the first implementation explicitly takes the difference of each column and the max and then divides by the sum.

**Can someone show mathematically why? Is one correct and the other one wrong?**

**Are the implementation similar in terms of code and time complexity? Which is more efficient?**

They"re both correct, but yours is preferred from the point of view of numerical stability.

You start with

```
e ^ (x - max(x)) / sum(e^(x - max(x))
```

By using the fact that a^(b - c) = (a^b)/(a^c) we have

```
= e ^ x / (e ^ max(x) * sum(e ^ x / e ^ max(x)))
= e ^ x / sum(e ^ x)
```

Which is what the other answer says. You could replace max(x) with any variable and it would cancel out.

(Well... much confusion here, both in the question and in the answers...)

To start with, the two solutions (i.e. yours and the suggested one) are **not** equivalent; they **happen** to be equivalent only for the special case of 1-D score arrays. You would have discovered it if you had tried also the 2-D score array in the Udacity quiz provided example.

Results-wise, the only actual difference between the two solutions is the `axis=0`

argument. To see that this is the case, let"s try your solution (`your_softmax`

) and one where the only difference is the `axis`

argument:

```
import numpy as np
# your solution:
def your_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
# correct solution:
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0) # only difference
```

As I said, for a 1-D score array, the results are indeed identical:

```
scores = [3.0, 1.0, 0.2]
print(your_softmax(scores))
# [ 0.8360188 0.11314284 0.05083836]
print(softmax(scores))
# [ 0.8360188 0.11314284 0.05083836]
your_softmax(scores) == softmax(scores)
# array([ True, True, True], dtype=bool)
```

Nevertheless, here are the results for the 2-D score array given in the Udacity quiz as a test example:

```
scores2D = np.array([[1, 2, 3, 6],
[2, 4, 5, 6],
[3, 8, 7, 6]])
print(your_softmax(scores2D))
# [[ 4.89907947e-04 1.33170787e-03 3.61995731e-03 7.27087861e-02]
# [ 1.33170787e-03 9.84006416e-03 2.67480676e-02 7.27087861e-02]
# [ 3.61995731e-03 5.37249300e-01 1.97642972e-01 7.27087861e-02]]
print(softmax(scores2D))
# [[ 0.09003057 0.00242826 0.01587624 0.33333333]
# [ 0.24472847 0.01794253 0.11731043 0.33333333]
# [ 0.66524096 0.97962921 0.86681333 0.33333333]]
```

The results are different - the second one is indeed identical with the one expected in the Udacity quiz, where all columns indeed sum to 1, which is not the case with the first (wrong) result.

So, all the fuss was actually for an implementation detail - the `axis`

argument. According to the numpy.sum documentation:

The default, axis=None, will sum all of the elements of the input array

while here we want to sum row-wise, hence `axis=0`

. For a 1-D array, the sum of the (only) row and the sum of all the elements happen to be identical, hence your identical results in that case...

The `axis`

issue aside, your implementation (i.e. your choice to subtract the max first) is actually **better** than the suggested solution! In fact, it is the recommended way of implementing the softmax function - see here for the justification (numeric stability, also pointed out by some other answers here).

So, this is really a comment to desertnaut"s answer but I can"t comment on it yet due to my reputation. As he pointed out, your version is only correct if your input consists of a single sample. If your input consists of several samples, it is wrong. **However, desertnaut"s solution is also wrong.** The problem is that once he takes a 1-dimensional input and then he takes a 2-dimensional input. Let me show this to you.

```
import numpy as np
# your solution:
def your_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
# desertnaut solution (copied from his answer):
def desertnaut_softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0) # only difference
# my (correct) solution:
def softmax(z):
assert len(z.shape) == 2
s = np.max(z, axis=1)
s = s[:, np.newaxis] # necessary step to do broadcasting
e_x = np.exp(z - s)
div = np.sum(e_x, axis=1)
div = div[:, np.newaxis] # dito
return e_x / div
```

Lets take desertnauts example:

```
x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)
```

This is the output:

```
your_softmax(x1)
array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037047]])
desertnaut_softmax(x1)
array([[ 1., 1., 1., 1.]])
softmax(x1)
array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037047]])
```

You can see that desernauts version would fail in this situation. (It would not if the input was just one dimensional like np.array([1, 2, 3, 6]).

Lets now use 3 samples since thats the reason why we use a 2 dimensional input. The following x2 is not the same as the one from desernauts example.

```
x2 = np.array([[1, 2, 3, 6], # sample 1
[2, 4, 5, 6], # sample 2
[1, 2, 3, 6]]) # sample 1 again(!)
```

This input consists of a batch with 3 samples. But sample one and three are essentially the same. We now expect 3 rows of softmax activations where the first should be the same as the third and also the same as our activation of x1!

```
your_softmax(x2)
array([[ 0.00183535, 0.00498899, 0.01356148, 0.27238963],
[ 0.00498899, 0.03686393, 0.10020655, 0.27238963],
[ 0.00183535, 0.00498899, 0.01356148, 0.27238963]])
desertnaut_softmax(x2)
array([[ 0.21194156, 0.10650698, 0.10650698, 0.33333333],
[ 0.57611688, 0.78698604, 0.78698604, 0.33333333],
[ 0.21194156, 0.10650698, 0.10650698, 0.33333333]])
softmax(x2)
array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037047],
[ 0.01203764, 0.08894682, 0.24178252, 0.65723302],
[ 0.00626879, 0.01704033, 0.04632042, 0.93037047]])
```

I hope you can see that this is only the case with my solution.

```
softmax(x1) == softmax(x2)[0]
array([[ True, True, True, True]], dtype=bool)
softmax(x1) == softmax(x2)[2]
array([[ True, True, True, True]], dtype=bool)
```

Additionally, here is the results of TensorFlows softmax implementation:

```
import tensorflow as tf
import numpy as np
batch = np.asarray([[1,2,3,6],[2,4,5,6],[1,2,3,6]])
x = tf.placeholder(tf.float32, shape=[None, 4])
y = tf.nn.softmax(x)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(y, feed_dict={x: batch})
```

And the result:

```
array([[ 0.00626879, 0.01704033, 0.04632042, 0.93037045],
[ 0.01203764, 0.08894681, 0.24178252, 0.657233 ],
[ 0.00626879, 0.01704033, 0.04632042, 0.93037045]], dtype=float32)
```

This book serves as a practical guide on how to utilize big data to store, process, and analyze structured data, focusing on three of the most popular Apache projects in the Hadoop ecosystem: Apache S...

10/07/2020

The field of Artificial Intelligence (AI), which can definitely be considered to be the parent field of deep learning, has a rich history going back to 1950. While we will not cover this history in mu...

23/09/2020

Deep Learning for Coders with fastai and PyTorch: AI Applications Without a PhD. Deep learning is often seen as the exclusive domain of math PhDs and big tech companies. But as this how-to guide sh...

05/09/2021

Target knows. Apple Computer knows, too. So do LinkedIn, Netflix, Facebook, Twitter, Expedia, national and local political campaigns, and dozens of other organizations that all generate enormous eco...

10/07/2020

X
# Submit new EBook