查找 numpy 数组中多个值的行索引-IT科技

摘要：问题描述：我有一个数组X：X = np.array([[4, 2], [9, 3], [8, 5], [3, 3], [5, 6]]) 我希望找到这个数组中几个值的行的索引：searche...

问题描述：

我有一个数组X：

X = np.array([[4,  2],
              [9,  3],
              [8,  5],
              [3,  3],
              [5,  6]])

我希望找到这个数组中几个值的行的索引：

searched_values = np.array([[4, 2],
                            [3, 3],
                            [5, 6]])

对于这个例子，我想要的结果如下：

[0,3,4]

我有一个执行此操作的代码，但我认为它过于复杂：

X = np.array([[4,  2],
              [9,  3],
              [8,  5],
              [3,  3],
              [5,  6]])

searched_values = np.array([[4, 2],
                            [3, 3],
                            [5, 6]])

result = []

for s in searched_values:
    idx = np.argwhere([np.all((X-s)==0, axis=1)])[0][1]
    result.append(idx)

print(result)

我找到了类似问题的答案，但它仅适用于一维数组。

有没有办法以更简单的方式做我想做的事情？

解决方案 1：

方法 #1

一种方法是使用NumPy broadcasting，就像这样 -

np.where((X==searched_values[:,None]).all(-1))[1]

方法 #2

一种节省内存的方法是将每一行转换为线性索引等价物，然后使用np.in1d，就像这样 -

dims = X.max(0)+1
out = np.where(np.in1d(np.ravel_multi_index(X.T,dims),\n                    np.ravel_multi_index(searched_values.T,dims)))[0]

方法 #3

另一种使用和转换为线性索引等效物的相同原理的内存高效方法np.searchsorted是这样的 -

dims = X.max(0)+1
X1D = np.ravel_multi_index(X.T,dims)
searched_valuesID = np.ravel_multi_index(searched_values.T,dims)
sidx = X1D.argsort()
out = sidx[np.searchsorted(X1D,searched_valuesID,sorter=sidx)]

请注意，此方法假定中np.searchsorted的每一行都有一个匹配项。searched_values`X`

怎么`np.ravel_multi_index`運作？

此函数为我们提供线性索引等效数字。它接受一个2D数组n-dimensional indices，设置为列，以及要将这些索引映射到的 n 维网格本身的形状，并计算等效线性索引。

让我们使用手头问题的输入。以输入为例X，并记下它的第一行。由于我们试图将的每一行转换X为其线性索引等价物，并且由于假设每列都是一个索引元组，因此我们需要在输入函数之前np.ravel_multi_index进行转置。由于在这种情况下每行的元素数为，因此要映射到的 n 维网格将是。如果每行有 3 个元素，则它将是用于映射的网格，依此类推。X`X22DX3D`

要了解此函数如何计算线性指标，请考虑第一行X-

In [77]: X
Out[77]: 
array([[4, 2],
       [9, 3],
       [8, 5],
       [3, 3],
       [5, 6]])

我们有 n 维网格的形状dims：

In [78]: dims
Out[78]: array([10,  7])

让我们创建二维网格来查看该映射如何工作以及线性指标如何计算np.ravel_multi_index-

In [79]: out = np.zeros(dims,dtype=int)

In [80]: out
Out[80]: 
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]])

让我们从中设置第一个索引元组，即从中进入网格的X第一行-X

In [81]: out[4,2] = 1

In [82]: out
Out[82]: 
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]])

现在，为了查看刚刚设置的元素的线性索引等效项，让我们展平并使用np.where来检测1。

In [83]: np.where(out.ravel())[0]
Out[83]: array([30])

如果考虑到行主排序，也可以计算这一点。

让我们使用np.ravel_multi_index并验证这些线性指标 -

In [84]: np.ravel_multi_index(X.T,dims)
Out[84]: array([30, 66, 61, 24, 41])

因此，我们将拥有与来自的每个索引元组相对应的线性索引X，即来自的每一行X。

选择维度以np.ravel_multi_index形成唯一的线性指标

现在，将的每一行视为 n 维网格的索引元组并将每个这样的元组转换为标量的想法X是让唯一的标量对应于唯一的元组，即中的唯一行X。

让我们再看看X——

In [77]: X
Out[77]: 
array([[4, 2],
       [9, 3],
       [8, 5],
       [3, 3],
       [5, 6]])

现在，如上一节所讨论的，我们将每一行视为索引元组。在每个这样的索引元组中，第一个元素表示 n 维网格的第一个轴，第二个元素表示网格的第二个轴，依此类推，直到中每一行的最后一个元素X。本质上，每一列表示网格的一个维度或轴。如果我们要将所有元素映射到X同一个 n 维网格上，我们需要考虑这种拟议的 n 维网格的每个轴的最大拉伸。假设我们处理的是中的正数，那么这样的拉伸将是+ 1X中每列的最大值。这是因为 Python 遵循索引。因此，例如，将映射到建议网格的第 10 行。类似地，将转到该网格的列。X`+ 10-based`**`X[1,0] == 9`****`X[4,1] == 67th`**

因此，对于我们的示例案例，我们有 -

In [7]: dims = X.max(axis=0) + 1 # Or simply X.max(0) + 1

In [8]: dims
Out[8]: array([10,  7])

因此，对于我们的样本情况，我们需要一个至少为形状的网格(10,7)。沿维度增加更多长度不会有害，并且还会为我们提供唯一的线性索引。

结束语：这里要注意的一件重要的事情是，如果中有负数X，则我们需要X在使用之前沿着每一列添加适当的偏移量以使这些索引元组成为正数np.ravel_multi_index。

解决方案 2：

numpy_indexed包（免责声明：我是它的作者）包含高效执行此类操作的功能（在底层也使用 searchsorted）。就功能而言，它充当 list.index 的矢量化等价物：

import numpy_indexed as npi
result = npi.indices(X, searched_values)

请注意，使用“missing”kwarg，您可以完全控制缺失项目的行为，并且它也适用于 nd 数组（fi；图像堆栈）。

X=[520000,28,28]更新：使用与@Rik和相同的形状searched_values=[20000,28,28]，它在中运行0.8064 secs，使用 missing=-1 来检测并表示 X 中不存在的条目。

解决方案 3：

另一种方法是将asvoid（下面）view每行用作 dtype
的单个void值。这会将 2D 数组简化为 1D 数组，从而允许您np.in1d照常使用：

import numpy as np

def asvoid(arr):
    """
    Based on http://stackoverflow.com/a/16973510/190597 (Jaime, 2013-06)
    View the array as dtype np.void (bytes). The items along the last axis are
    viewed as one value. This allows comparisons to be performed which treat
    entire rows as one value.
    """
    arr = np.ascontiguousarray(arr)
    if np.issubdtype(arr.dtype, np.floating):
        """ Care needs to be taken here since
        np.array([-0.]).view(np.void) != np.array([0.]).view(np.void)
        Adding 0. converts -0. to 0.
        """
        arr += 0.
    return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))

X = np.array([[4,  2],
              [9,  3],
              [8,  5],
              [3,  3],
              [5,  6]])

searched_values = np.array([[4, 2],
                            [3, 3],
                            [5, 6]])

idx = np.flatnonzero(np.in1d(asvoid(X), asvoid(searched_values)))
print(idx)
# [0 3 4]

解决方案 4：

这是一个相当快的解决方案，使用 numpy 和 hashlib 可以很好地扩展。它可以在几秒钟内处理大维矩阵或图像。我在 2 秒内用我的 CPU 处理了 520000 X（28 X 28）数组和 20000 X（28 X 28）数组

代码：

import numpy as np
import hashlib


X = np.array([[4,  2],
              [9,  3],
              [8,  5],
              [3,  3],
              [5,  6]])

searched_values = np.array([[4, 2],
                            [3, 3],
                            [5, 6]])

#hash using sha1 appears to be efficient
xhash=[hashlib.sha1(row).digest() for row in X]
yhash=[hashlib.sha1(row).digest() for row in searched_values]

z=np.in1d(xhash,yhash)  

##Use unique to get unique indices to ind1 results
_,unique=np.unique(np.array(xhash)[z],return_index=True)

##Compute unique indices by indexing an array of indices
idx=np.array(range(len(xhash)))
unique_idx=idx[z][unique]

print('unique_idx=',unique_idx)
print('X[unique_idx]=',X[unique_idx])

输出：

unique_idx= [4 3 0]
X[unique_idx]= [[5 6]
 [3 3]
 [4 2]]

解决方案 5：

X = np.array([[4,  2],
              [9,  3],
              [8,  5],
              [3,  3],
              [5,  6]])

S = np.array([[4, 2],
              [3, 3],
              [5, 6]])

result = [[i for i,row in enumerate(X) if (s==row).all()] for s in S]

或者

result = [i for s in S for i,row in enumerate(X) if (s==row).all()]

如果您想要一个平面列表（假设每个搜索值只有一个匹配项）。

解决方案 6：

我有类似的要求并且以下对我有用：

np.argwhere(np.isin(X, searched_values).all(axis=1))

解决方案 7：

另一种方法是使用这样的cdist函数scipy.spatial.distance：

np.nonzero(cdist(X, searched_values) == 0)[0]

基本上，我们得到的行号X，其与中的行的距离为零searched_values，这意味着它们相等。如果将行视为坐标，则有意义。

解决方案 8：

以下是我得到的结果：

def find_points(orig: np.ndarray, search: np.ndarray) -> np.ndarray:
    equals = [np.equal(orig, p).all(1) for p in search]
    exists = np.max(equals, axis=1)
    indices = np.argmax(equals, axis=1)
    indices[exists == False] = -1
    return indices

测试：

X = np.array([[4,  2],
              [9,  3],
              [8,  5],
              [3,  3],
              [5,  6]])

searched_values = np.array([[4, 2],
                            [3, 3],
                            [5, 6],
                            [0, 0]])

find_points(X, searched_values)

输出：