在 numpy.array 中查找唯一行-IT科技

摘要：问题描述：我需要在中找到唯一的行numpy.array。例如：>>> a # I have array([[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0...

问题描述：

我需要在中找到唯一的行numpy.array。

例如：

>>> a # I have
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
       [0, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 0]])

我知道我可以创建一个集合并循环遍历数组，但我正在寻找一种有效的纯numpy解决方案。我相信有一种方法可以将数据类型设置为 void，然后我就可以使用numpy.unique，但我不知道如何让它工作。

解决方案 1：

从 NumPy 1.13 开始，可以简单地选择轴来选择任何 N 维数组中的唯一值。要获取唯一行，请使用np.unique以下命令：

unique_rows = np.unique(original_array, axis=0)

解决方案 2：

另一个可能的解决方案

np.vstack({tuple(row) for row in a})

编辑：正如其他人提到的，这种方法从 NumPy 1.16 开始已被弃用。在现代版本中，你可以这样做

np.vstack(tuple(set(map(tuple,a))))

通过map(tuple,a)使矩阵的每一行成为a元组，使它们可哈希化。set(map(tuple,a))用所有这些唯一行创建一个集合。集合是非序列可迭代对象，因此不能再直接用于构造 NumPy 数组。外部调用tuple通过将集合转换为元组来解决此问题，使其可以用于创建数组。

解决方案 3：

使用结构化数组的另一种选择是使用将void整行合并为单个项目的类型的视图：

a = np.array([[1, 1, 1, 0, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [0, 1, 1, 1, 0, 0],
              [1, 1, 1, 0, 0, 0],
              [1, 1, 1, 1, 1, 0]])

b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)

unique_a = a[idx]

>>> unique_a
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

编辑根据@seberg 的建议
添加np.ascontiguousarray。如果数组不连续，这会减慢方法的速度。

编辑
上述操作可以稍微加快速度，但可能会以牺牲清晰度为代价，方法是：

unique_a = np.unique(b).view(a.dtype).reshape(-1, a.shape[1])

此外，至少在我的系统上，性能方面它与 lexsort 方法相当，甚至更好：

a = np.random.randint(2, size=(10000, 6))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
100 loops, best of 3: 3.17 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
100 loops, best of 3: 5.93 ms per loop

a = np.random.randint(2, size=(10000, 100))

%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
10 loops, best of 3: 29.9 ms per loop

%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
10 loops, best of 3: 116 ms per loop

解决方案 4：

如果您想避免转换为一系列元组或其他类似数据结构的内存开销，您可以利用 numpy 的结构化数组。

诀窍是将原始数组视为结构化数组，其中每个项目对应原始数组的一行。这不会进行复制，而且非常高效。

举一个简单的例子：

import numpy as np

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])

ncols = data.shape[1]
dtype = data.dtype.descr * ncols
struct = data.view(dtype)

uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)
print uniq

要了解发生了什么，请看一下中间结果。

一旦我们将事物视为结构化数组，数组中的每个元素都是原始数组中的一行。（基本上，它是与元组列表类似的数据结构。）

In [71]: struct
Out[71]:
array([[(1, 1, 1, 0, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(0, 1, 1, 1, 0, 0)],
       [(1, 1, 1, 0, 0, 0)],
       [(1, 1, 1, 1, 1, 0)]],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

In [72]: struct[0]
Out[72]:
array([(1, 1, 1, 0, 0, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

一旦运行numpy.unique，我们将得到一个结构化数组：

In [73]: np.unique(struct)
Out[73]:
array([(0, 1, 1, 1, 0, 0), (1, 1, 1, 0, 0, 0), (1, 1, 1, 1, 1, 0)],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])

然后我们需要将其视为一个“正常”数组（_将最后一次计算的结果存储在中ipython，这就是您看到的原因_.view...）：

In [74]: _.view(data.dtype)
Out[74]: array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0])

然后重塑为二维数组（-1是一个占位符，告诉 numpy 计算正确的行数，给出列数）：

In [75]: _.reshape(-1, ncols)
Out[75]:
array([[0, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0]])

显然，如果你想更简洁，你可以写成：

import numpy as np

def unique_rows(data):
    uniq = np.unique(data.view(data.dtype.descr * data.shape[1]))
    return uniq.view(data.dtype).reshape(-1, data.shape[1])

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]])
print unique_rows(data)

其结果是：

[[0 1 1 1 0 0]
 [1 1 1 0 0 0]
 [1 1 1 1 1 0]]

解决方案 5：

np.unique当我运行它时，np.random.random(100).reshape(10,10)返回所有唯一的单个元素，但是您想要唯一的行，因此首先需要将它们放入元组中：

array = #your numpy array of lists
new_array = [tuple(row) for row in array]
uniques = np.unique(new_array)

这是我看到你改变类型来做你想做的事情的唯一方法，而且我不确定将列表迭代更改为元组是否可以接受你的“不循环”

解决方案 6：

np.unique 的工作原理是先对扁平数组进行排序，然后查看每个项是否与前一个项相等。这可以手动完成，而无需扁平化：

ind = np.lexsort(a.T)
a[ind[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]]

该方法不使用元组，并且比此处给出的其他方法更快、更简单。

注意：此方法的先前版本在 a[ 之后没有 ind，这意味着使用了错误的索引。此外，Joe Kington 指出，这确实会产生各种中间副本。以下方法产生的副本较少，方法是生成已排序的副本，然后使用它的视图：

b = a[np.lexsort(a.T)]
b[np.concatenate(([True], np.any(b[1:] != b[:-1],axis=1)))]

这样速度更快并且占用更少的内存。

此外，如果您想在 ndarray 中找到唯一的行，无论数组中有多少维，都可以执行以下操作：

b = a[lexsort(a.reshape((a.shape[0],-1)).T)];
b[np.concatenate(([True], np.any(b[1:]!=b[:-1],axis=tuple(range(1,a.ndim)))))]

剩下的一个有趣的问题是，如果你想沿任意维数组的任意轴进行排序/唯一化，这会变得更加困难。

编辑：

为了展示速度差异，我在 ipython 中对答案中描述的三种不同方法进行了一些测试。使用您的确切 a，没有太大区别，尽管此版本速度稍快一些：

In [87]: %timeit unique(a.view(dtype)).view('<i8')
10000 loops, best of 3: 48.4 us per loop

In [88]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True], np.any(a[ind[1:]]!= a[ind[:-1]], axis=1)))]
10000 loops, best of 3: 37.6 us per loop

In [89]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10000 loops, best of 3: 41.6 us per loop

然而，当 a 更大时，这个版本的速度会快得多：

In [96]: a = np.random.randint(0,2,size=(10000,6))

In [97]: %timeit unique(a.view(dtype)).view('<i8')
10 loops, best of 3: 24.4 ms per loop

In [98]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10 loops, best of 3: 28.2 ms per loop

In [99]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!= a[ind[:-1]],axis=1)))]
100 loops, best of 3: 3.25 ms per loop

解决方案 7：

我比较了建议的替代方案的速度，令人惊讶的是，使用参数后，void viewunique解决方案甚至比 numpy 的原生解决方案还要快一点。如果你追求速度，你会想要unique`axis`

numpy.unique(
    a.view(numpy.dtype((numpy.void, a.dtype.itemsize*a.shape[1])))
).view(a.dtype).reshape(-1, a.shape[1])

我已经在npx.unique_rows中实现了最快的变体。

GitHub 上也有关于此的错误报告。

在此处输入图片描述

重现情节的代码：

import numpy
import perfplot


def unique_void_view(a):
    return (
        numpy.unique(a.view(numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))))
        .view(a.dtype)
        .reshape(-1, a.shape[1])
    )


def lexsort(a):
    ind = numpy.lexsort(a.T)
    return a[
        ind[numpy.concatenate(([True], numpy.any(a[ind[1:]] != a[ind[:-1]], axis=1)))]
    ]


def vstack(a):
    return numpy.vstack([tuple(row) for row in a])


def unique_axis(a):
    return numpy.unique(a, axis=0)


perfplot.show(
    setup=lambda n: numpy.random.randint(2, size=(n, 20)),
    kernels=[unique_void_view, lexsort, vstack, unique_axis],
    n_range=[2 ** k for k in range(15)],
    xlabel="len(a)",
    equality_check=None,
)