如何在Python中实现Softmax函数？-IT科技

摘要：问题描述：从Udacity 的深度学习课程中得知，的 softmaxy_i就是指数除以整个 Y 向量的指数和：其中S(y_i)是的softmax函数y_i，e是指数，j是输入向量Y的列数。我尝试了以下方法：import numpy as np def softmax(x): ""...

问题描述：

从Udacity 的深度学习课程中得知，的 softmaxy_i就是指数除以整个 Y 向量的指数和：

在此处输入图片描述

其中S(y_i)是的softmax函数y_i，e是指数，j是输入向量Y的列数。

我尝试了以下方法：

import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

scores = [3.0, 1.0, 0.2]
print(softmax(scores))

[ 0.8360188   0.11314284  0.05083836]

但建议的解决方案是：

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

它产生与第一个实现相同的输出，即使第一个实现明确取每列与最大值的差值，然后除以总和。

有人能从数学上解释一下原因吗？一个是正确的，另一个是错误的吗？

代码和时间复杂度方面的实现是否相似？哪个更有效率？

解决方案 1：

它们都是正确的，但是从数值稳定性的角度来看，你的观点更可取。

你开始

e ^ (x - max(x)) / sum(e^(x - max(x))

利用 a^(b - c) = (a^b)/(a^c) 的事实，我们得到

= e ^ x / (e ^ max(x) * sum(e ^ x / e ^ max(x)))

= e ^ x / sum(e ^ x)

另一个答案也是这么说的。你可以用任何变量替换 max(x)，它都会被抵消。

解决方案 2：

（嗯......这里有许多令人困惑的地方，无论是问题还是答案......）

首先，这两个解决方案（即您的解决方案和建议的解决方案）并不等价；它们恰好只在 1-D 分数数组的特殊情况下等价。如果您也尝试过 Udacity 测验提供的示例中的 2-D 分数数组，您就会发现这一点。

从结果来看，这两个解决方案之间唯一的实际差异是axis=0参数。为了验证这一点，让我们尝试一下您的解决方案 ( your_softmax) 以及唯一的差异是axis参数的解决方案：

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# correct solution:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

正如我所说，对于一维分数数组，结果确实是相同的：

scores = [3.0, 1.0, 0.2]
print(your_softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
print(softmax(scores))
# [ 0.8360188   0.11314284  0.05083836]
your_softmax(scores) == softmax(scores)
# array([ True,  True,  True], dtype=bool)

尽管如此，以下是 Udacity 测验中作为测试示例给出的二维分数数组的结果：

scores2D = np.array([[1, 2, 3, 6],
                     [2, 4, 5, 6],
                     [3, 8, 7, 6]])

print(your_softmax(scores2D))
# [[  4.89907947e-04   1.33170787e-03   3.61995731e-03   7.27087861e-02]
#  [  1.33170787e-03   9.84006416e-03   2.67480676e-02   7.27087861e-02]
#  [  3.61995731e-03   5.37249300e-01   1.97642972e-01   7.27087861e-02]]

print(softmax(scores2D))
# [[ 0.09003057  0.00242826  0.01587624  0.33333333]
#  [ 0.24472847  0.01794253  0.11731043  0.33333333]
#  [ 0.66524096  0.97962921  0.86681333  0.33333333]]

结果是不同的 - 第二个结果确实与 Udacity 测验中预期的结果相同，其中所有列确实总和为 1，而第一个（错误）结果并非如此。

因此，所有的争论实际上都是为了实现细节——axis参数。根据numpy.sum 文档：

默认值 axis=None 将对输入数组的所有元素求和

而这里我们想按行求和，因此axis=0。对于一维数组，(唯一)行的总和与所有元素的总和恰好相同，因此在这种情况下您的结果相同...

抛开这个问题axis，您的实现（即您选择先减去最大值）实际上比建议的解决方案更好！事实上，这是实现 softmax 函数的推荐方法 - 请参阅此处以了解理由（数值稳定性，此处的其他一些答案也指出了这一点）。

解决方案 3：

所以，这实际上是对 desertnaut 的回答的评论，但由于我的名声，我暂时无法对此发表评论。正如他指出的那样，只有当您的输入由单个样本组成时，您的版本才是正确的。如果您的输入由多个样本组成，则它是错误的。但是，desertnaut 的解决方案也是错误的。问题是，一旦他采用一维输入，然后他又采用二维输入。让我向您展示这一点。

import numpy as np

# your solution:
def your_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

# desertnaut solution (copied from his answer): 
def desertnaut_softmax(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0) # only difference

# my (correct) solution:
def softmax(z):
    assert len(z.shape) == 2
    s = np.max(z, axis=1)
    s = s[:, np.newaxis] # necessary step to do broadcasting
    e_x = np.exp(z - s)
    div = np.sum(e_x, axis=1)
    div = div[:, np.newaxis] # dito
    return e_x / div

让我们以 desertnauts 为例：

x1 = np.array([[1, 2, 3, 6]]) # notice that we put the data into 2 dimensions(!)

这是输出：

your_softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

desertnaut_softmax(x1)
array([[ 1.,  1.,  1.,  1.]])

softmax(x1)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

您可以看到 desernauts 版本在这种情况下会失败。（如果输入只是一维的，如 np.array([1, 2, 3, 6])，则不会失败。

现在让我们使用 3 个样本，因为这就是我们使用二维输入的原因。以下 x2 与 desernauts 示例中的 x2 不同。

x2 = np.array([[1, 2, 3, 6],  # sample 1
               [2, 4, 5, 6],  # sample 2
               [1, 2, 3, 6]]) # sample 1 again(!)

此输入由包含 3 个样本的批次组成。但样本一和样本三本质上是相同的。我们现在需要 3 行 softmax 激活，其中第一行应该与第三行相同，也与 x1 的激活相同！

your_softmax(x2)
array([[ 0.00183535,  0.00498899,  0.01356148,  0.27238963],
       [ 0.00498899,  0.03686393,  0.10020655,  0.27238963],
       [ 0.00183535,  0.00498899,  0.01356148,  0.27238963]])


desertnaut_softmax(x2)
array([[ 0.21194156,  0.10650698,  0.10650698,  0.33333333],
       [ 0.57611688,  0.78698604,  0.78698604,  0.33333333],
       [ 0.21194156,  0.10650698,  0.10650698,  0.33333333]])

softmax(x2)
array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037047],
       [ 0.01203764,  0.08894682,  0.24178252,  0.65723302],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037047]])

我希望您能明白，这只是我的解决方案的情况。

softmax(x1) == softmax(x2)[0]
array([[ True,  True,  True,  True]], dtype=bool)

softmax(x1) == softmax(x2)[2]
array([[ True,  True,  True,  True]], dtype=bool)

此外，以下是 TensorFlows softmax 实现的结果：

import tensorflow as tf
import numpy as np
batch = np.asarray([[1,2,3,6],[2,4,5,6],[1,2,3,6]])
x = tf.placeholder(tf.float32, shape=[None, 4])
y = tf.nn.softmax(x)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(y, feed_dict={x: batch})

结果如下：

array([[ 0.00626879,  0.01704033,  0.04632042,  0.93037045],
       [ 0.01203764,  0.08894681,  0.24178252,  0.657233  ],
       [ 0.00626879,  0.01704033,  0.04632042,  0.93037045]], dtype=float32)

解决方案 4：

我想说的是，虽然从数学上来说两者都是正确的，但从实现上来说，第一种更好。计算 softmax 时，中间值可能会变得非常大。将两个大数相除可能会在数值上不稳定。这些笔记（来自斯坦福大学）提到了一个规范化技巧，这基本上就是你正在做的事情。

解决方案 5：

从数学角度来看，双方相等。

你可以很容易地证明这一点。让我们m=max(x)。现在你的函数softmax返回一个向量，其第 i 个坐标等于

在此处输入图片描述

请注意，这适用于任何m，因为对于所有（甚至复数）数字e^m != 0

从计算复杂性的角度来看，它们也是等效的并且都O(n)按时运行，其中n是向量的大小。
从数值稳定性的角度来看，第一个解决方案是首选，因为它e^x增长非常快，即使很小的值x也会溢出。减去最大值可以消除这种溢出。要实际体验我所说的内容，请尝试输入x = np.array([1000, 5])两个函数。一个将返回正确的概率，第二个将溢出nan
您的解决方案仅适用于向量（Udacity 测验希望您也计算矩阵）。为了修复它，您需要使用sum(axis=0)

解决方案 6：

编辑。从 1.2.0 版开始，scipy 将 softmax 包含为特殊函数：

https://scipy.github.io/devdocs/ generated/scipy.special.softmax.html

我编写了一个将 softmax 应用于任意轴的函数：

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats. 
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the 
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter, 
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p

正如其他用户所述，减去最大值是一种很好的做法。我在这里写了一篇关于它的详细帖子。

解决方案 7：

在这里您可以了解他们为什么使用- max。

从那里：

“当你在实际中编写计算 Softmax 函数的代码时，由于指数的原因，中间项可能会非常大。除以大数可能会在数值上不稳定，因此使用规范化技巧非常重要。”

解决方案 8：

我很好奇这些之间的性能差异

import numpy as np

def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)

def softmaxv2(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()

def softmaxv3(x):
    """Compute softmax values for each sets of scores in x."""
    e_x = np.exp(x - np.max(x))
    return e_x / np.sum(e_x, axis=0)

def softmaxv4(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x - np.max(x)) / np.sum(np.exp(x - np.max(x)), axis=0)



x=[10,10,18,9,15,3,1,2,1,10,10,10,8,15]

使用

print("----- softmax")
%timeit  a=softmax(x)
print("----- softmaxv2")
%timeit  a=softmaxv2(x)
print("----- softmaxv3")
%timeit  a=softmaxv2(x)
print("----- softmaxv4")
%timeit  a=softmaxv2(x)

增加 x 内的值（+100 +200 +500......）我使用原始 numpy 版本获得持续更好的结果（这里只是一次测试）

----- softmax
The slowest run took 8.07 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 17.8 µs per loop
----- softmaxv2
The slowest run took 4.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23 µs per loop
----- softmaxv3
The slowest run took 4.06 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23 µs per loop
----- softmaxv4
10000 loops, best of 3: 23 µs per loop

直到... x 里面的值达到~800，然后我得到

----- softmax
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: RuntimeWarning: overflow encountered in exp
  after removing the cwd from sys.path.
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:4: RuntimeWarning: invalid value encountered in true_divide
  after removing the cwd from sys.path.
The slowest run took 18.41 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop
----- softmaxv2
The slowest run took 4.18 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 22.8 µs per loop
----- softmaxv3
The slowest run took 19.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.6 µs per loop
----- softmaxv4
The slowest run took 16.82 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 22.7 µs per loop

正如一些人所说，您的版本对于“大数”在数值上更稳定。对于小数，情况可能正好相反。

解决方案 9：

为了提供替代解决方案，请考虑您的参数数量非常大的情况，以至于exp(x)会出现下溢（在负情况下）或溢出（在正情况下）。在这里，您希望尽可能长时间地停留在对数空间中，仅在您可以相信结果表现良好的最后进行指数运算。

import scipy.special as sc
import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
    return np.exp(x - sc.logsumexp(x))

解决方案 10：

我需要一些与Tensorflow的密集层输出兼容的东西。

@desertnaut的解决方案在这种情况下不起作用，因为我有批量数据。因此，我想出了另一种在两种情况下都有效的解决方案：

def softmax(x, axis=-1):
    e_x = np.exp(x - np.max(x)) # same code
    return e_x / e_x.sum(axis=axis, keepdims=True)

结果：

logits = np.asarray([
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921], # 1
    [-0.0052024,  -0.00770216,  0.01360943, -0.008921]  # 2
])

print(softmax(logits))

#[[0.2492037  0.24858153 0.25393605 0.24827873]
# [0.2492037  0.24858153 0.25393605 0.24827873]]

参考：Tensorflow softmax

解决方案 11：

更简洁的版本是：

def softmax(x):
    return np.exp(x) / np.exp(x).sum(axis=0)

解决方案 12：

我建议这样做：

def softmax(z):
    z_norm=np.exp(z-np.max(z,axis=0,keepdims=True))
    return(np.divide(z_norm,np.sum(z_norm,axis=0,keepdims=True)))

它适用于随机和批量。

有关更多详细信息，请参阅：
https://medium.com/@ravish1729/analysis-of-softmax-function-ad058d6a564d

解决方案 13：

为了保持数值稳定性，应该减去max（x）。以下是softmax函数的代码；

定义softmax（x）：

if len(x.shape) > 1:
    tmp = np.max(x, axis = 1)
    x -= tmp.reshape((x.shape[0], 1))
    x = np.exp(x)
    tmp = np.sum(x, axis = 1)
    x /= tmp.reshape((x.shape[0], 1))
else:
    tmp = np.max(x)
    x -= tmp
    x = np.exp(x)
    tmp = np.sum(x)
    x /= tmp


return x

解决方案 14：

上面的答案已经非常详细地回答了。max减去以避免溢出。我在这里添加了python3中的另一个实现。

import numpy as np
def softmax(x):
    mx = np.amax(x,axis=1,keepdims = True)
    x_exp = np.exp(x - mx)
    x_sum = np.sum(x_exp, axis = 1, keepdims = True)
    res = x_exp / x_sum
    return res

x = np.array([[3,2,4],[4,5,6]])
print(softmax(x))

解决方案 15：

似乎每个人都发布了他们的解决方案，因此我也发布我的解决方案：

def softmax(x):
    e_x = np.exp(x.T - np.max(x, axis = -1))
    return (e_x / e_x.sum(axis=0)).T

我得到的结果与从 sklearn 导入的结果完全相同：

from sklearn.utils.extmath import softmax

解决方案 16：

import tensorflow as tf
import numpy as np

def softmax(x):
    return (np.exp(x).T / np.exp(x).sum(axis=-1)).T

logits = np.array([[1, 2, 3], [3, 10, 1], [1, 2, 5], [4, 6.5, 1.2], [3, 6, 1]])

sess = tf.Session()
print(softmax(logits))
print(sess.run(tf.nn.softmax(logits)))
sess.close()

解决方案 17：

根据所有回复和CS231n 注释，请允许我总结一下：

def softmax(x, axis):
    x -= np.max(x, axis=axis, keepdims=True)
    return np.exp(x) / np.exp(x).sum(axis=axis, keepdims=True)

用法：

x = np.array([[1, 0, 2,-1],
              [2, 4, 6, 8], 
              [3, 2, 1, 0]])
softmax(x, axis=1).round(2)

输出：

array([[0.24, 0.09, 0.64, 0.03],
       [0.  , 0.02, 0.12, 0.86],
       [0.64, 0.24, 0.09, 0.03]])

解决方案 18：

softmax 函数是一种激活函数，它将数字转换为总和为 1 的概率。softmax 函数输出一个向量，该向量表示结果列表的概率分布。它也是深度学习分类任务中使用的核心元素。

当我们有多个类时，使用 Softmax 函数。

它对于找出具有最大概率的类别很有用。

Softmax 函数最适合用在输出层，我们实际上试图在该层获得定义每个输入类别的概率。

范围从 0 到 1。

Softmax 函数将 logits [2.0, 1.0, 0.1] 转换为概率 [0.7, 0.2, 0.1]，概率总和为 1。Logits 是神经网络最后一层输出的原始分数。在激活之前。要理解 softmax 函数，我们必须查看第 (n-1) 层的输出。

softmax 函数实际上是一个 arg max 函数。这意味着它不会返回输入中的最大值，而是返回最大值的位置。

例如：

在softmax之前

X = [13, 31, 5]

softmax之后

array([1.52299795e-08, 9.99999985e-01, 5.10908895e-12]

代码：

import numpy as np

# your solution:

def your_softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum() 

# correct solution: 

def softmax(x): 

"""Compute softmax values for each sets of scores in x.""" 

e_x = np.exp(x - np.max(x)) 

return e_x / e_x.sum(axis=0) 

# only difference

解决方案 19：

这也适用于 np.reshape。

   def softmax( scores):
        """
        Compute softmax scores given the raw output from the model

        :param scores: raw scores from the model (N, num_classes)
        :return:
            prob: softmax probabilities (N, num_classes)
        """
        prob = None

        exponential = np.exp(
            scores - np.max(scores, axis=1).reshape(-1, 1)
        )  # subract the largest number https://jamesmccaffrey.wordpress.com/2016/03/04/the-max-trick-when-computing-softmax/
        prob = exponential / exponential.sum(axis=1).reshape(-1, 1)

        

        return prob

解决方案 20：

softmax 函数的目的是保留向量的比率，而不是在值饱和时用 S 形函数挤压端点（即趋向于 +/- 1（tanh）或从 0 到 1（logistical））。这是因为它保留了有关端点变化率的更多信息，因此更适用于具有 1-of-N 输出编码的神经网络（即，如果我们挤压端点，则更难区分 1-of-N 输出类，因为我们无法分辨哪一个是“最大”或“最小”，因为它们被挤压了。）；此外，它使总输出总和为 1，明显的赢家将更接近 1，而其他接近的数字将总和为 1/p，其中 p 是具有相似值的输出神经元的数量。

从向量中减去最大值的目的是，当你执行 e^y 指数时，你可能会得到非常高的值，将浮点数限制在最大值处，从而导致平局，而本例中并非如此。如果你减去最大值得到一个负数，那么这将成为一个大问题，然后你会得到一个负指数，它会迅速缩小值并改变比率，这就是发帖人的问题中发生的情况，并产生了错误的答案。

Udacity 提供的答案效率极低。我们需要做的第一件事是计算所有向量分量的 e^y_j，保留这些值，然后将它们相加，然后除以。Udacity 搞砸了，因为他们计算了两次 e^y_j！！！以下是正确答案：

def softmax(y):
    e_to_the_y_j = np.exp(y)
    return e_to_the_y_j / np.sum(e_to_the_y_j, axis=0)

解决方案 21：

我想补充一点对这个问题的理解。这里减去数组的最大值是正确的。但是如果你运行另一篇文章中的代码，你会发现当数组是二维或更高维度时，它不会给你正确的答案。

这里我给大家几点建议：

为了获得最大值，尝试沿 x 轴进行，您将得到一个一维数组。
将您的最大数组重塑为原始形状。
np.exp 是否获取指数值。
沿轴进行 np.sum。
得到最终结果。

根据结果，通过矢量化，您将得到正确的答案。由于它与大学作业有关，我无法在这里发布确切的代码，但如果您不理解，我愿意提供更多建议。

解决方案 22：

目标是使用 Numpy 和 Tensorflow 实现类似的结果。与原始答案相比，唯一的变化是apiaxis的参数np.sum。

初步方法：axis=0- 但当维度为 N 时，这并不能提供预期的结果。

修改后的方法：axis=len(e_x.shape)-1- 始终在最后一个维度上求和。这提供与 tensorflow 的 softmax 函数类似的结果。

def softmax_fn(input_array):
    """
    | **@author**: Prathyush SP
    |
    | Calculate Softmax for a given array
    :param input_array: Input Array
    :return: Softmax Score
    """
    e_x = np.exp(input_array - np.max(input_array))
    return e_x / e_x.sum(axis=len(e_x.shape)-1)

解决方案 23：

以下是使用 numpy 的广义解决方案以及与 tensorflow 和 scipy 的正确性比较：

数据准备：

import numpy as np

np.random.seed(2019)

batch_size = 1
n_items = 3
n_classes = 2
logits_np = np.random.rand(batch_size,n_items,n_classes).astype(np.float32)
print('logits_np.shape', logits_np.shape)
print('logits_np:')
print(logits_np)

输出：

logits_np.shape (1, 3, 2)
logits_np:
[[[0.9034822  0.3930805 ]
  [0.62397    0.6378774 ]
  [0.88049906 0.299172  ]]]

使用 TensorFlow 的 Softmax：

import tensorflow as tf

logits_tf = tf.convert_to_tensor(logits_np, np.float32)
scores_tf = tf.nn.softmax(logits_np, axis=-1)

print('logits_tf.shape', logits_tf.shape)
print('scores_tf.shape', scores_tf.shape)

with tf.Session() as sess:
    scores_np = sess.run(scores_tf)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np,axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出：

logits_tf.shape (1, 3, 2)
scores_tf.shape (1, 3, 2)
scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

使用 scipy 的 Softmax：

from scipy.special import softmax

scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出：

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.4965232  0.5034768 ]
  [0.6413727  0.35862732]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

使用 numpy 的 Softmax（https://nolanbconaway.github.io/blog/2017/softmax-numpy）：

def softmax(X, theta = 1.0, axis = None):
    """
    Compute the softmax of each element along an axis of X.

    Parameters
    ----------
    X: ND-Array. Probably should be floats.
    theta (optional): float parameter, used as a multiplier
        prior to exponentiation. Default = 1.0
    axis (optional): axis to compute values along. Default is the
        first non-singleton axis.

    Returns an array the same size as X. The result will sum to 1
    along the specified axis.
    """

    # make X at least 2d
    y = np.atleast_2d(X)

    # find axis
    if axis is None:
        axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)

    # multiply y against the theta parameter,
    y = y * float(theta)

    # subtract the max for numerical stability
    y = y - np.expand_dims(np.max(y, axis = axis), axis)

    # exponentiate y
    y = np.exp(y)

    # take the sum along the specified axis
    ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)

    # finally: divide elementwise
    p = y / ax_sum

    # flatten if X was 1D
    if len(X.shape) == 1: p = p.flatten()

    return p


scores_np = softmax(logits_np, axis=-1)

print('scores_np.shape', scores_np.shape)
print('scores_np:')
print(scores_np)

print('np.sum(scores_np, axis=-1).shape', np.sum(scores_np, axis=-1).shape)
print('np.sum(scores_np, axis=-1):')
print(np.sum(scores_np, axis=-1))

输出：

scores_np.shape (1, 3, 2)
scores_np:
[[[0.62490064 0.37509936]
  [0.49652317 0.5034768 ]
  [0.64137274 0.3586273 ]]]
np.sum(scores_np, axis=-1).shape (1, 3)
np.sum(scores_np, axis=-1):
[[1. 1. 1.]]

解决方案 24：

这是概括性的，并假设您正在规范化尾随维度。

def softmax(x: np.ndarray) -> np.ndarray:
    e_x = np.exp(x - np.max(x, axis=-1)[..., None])
    e_y = e_x.sum(axis=-1)[..., None]
    return e_x / e_y