加速二维数组上的NumPy循环删除simi行

3条回答

网友

1楼 · 编辑于 2024-10-01 13:36:25

第一行：

t = np.sum(np.absolute(a[i,:] - a), axis=1)<0.01

每次取一行与整个数组之差的绝对值之和。这可能不是您所需要的，请尝试获取当前行与数组中稍后的行之间的差异。您已经将前面的所有行与当前行进行了比较，为什么还要再次进行比较？

另外，从数组中删除行是一项代价高昂、速度慢的操作，因此您可能会发现检查所有行然后删除几乎重复的行会更快。您也不能检查任何已计划删除的行，因为您知道它们将被删除。你知道吗

网友

2楼 · 编辑于 2024-10-01 13:36:25

让我们创建一些假数据：

np.random.seed(0)
a = np.random.random((4,3))

现在我们有：

array([[ 0.5488135 ,  0.71518937,  0.60276338],
       [ 0.54488318,  0.4236548 ,  0.64589411],
       [ 0.43758721,  0.891773  ,  0.96366276],
       [ 0.38344152,  0.79172504,  0.52889492]])

接下来，我们需要所有行对的元素差异之和。我们可以使用Manhattan Distance：

d = sklearn.metrics.pairwise.manhattan_distances(a)

它给出：

array([[ 0.        ,  0.33859562,  0.64870931,  0.31577611],
       [ 0.33859562,  0.        ,  0.89318282,  0.6465111 ],
       [ 0.64870931,  0.89318282,  0.        ,  0.5889615 ],
       [ 0.31577611,  0.6465111 ,  0.5889615 ,  0.        ]])

现在可以应用阈值，只保留一个三角形：

m = np.tril(d < 0.4, -1) # large threshold just for this example

得到一个布尔掩码：

array([[False, False, False, False],
       [ True, False, False, False],
       [False, False, False, False],
       [ True, False, False, False]], dtype=bool)

这说明第0行与第1行和第3行“太相似”。现在，您可以从原始矩阵中移除掩码的任何元素为真的行：

a[~np.any(m, axis=0)] # axis can be either 0 or 1 - design choice

这给了你：

array([[ 0.54488318,  0.4236548 ,  0.64589411],
       [ 0.43758721,  0.891773  ,  0.96366276],
       [ 0.38344152,  0.79172504,  0.52889492]])

综合起来：

d = sklearn.metrics.pairwise.manhattan_distances(a)
a = a[~np.any(np.tril(d < 0.4, -1), axis=0)]

网友

3楼 · 编辑于 2024-10-01 13:36:25

方法#1:广播

以下是一种矢量化方法，它利用^{}将a扩展到3D，然后以矢量化的方式在所有迭代中执行这些计算-

mask = (np.absolute(a[:,None] - a)).sum(2) < 0.01
a_out = a[~np.triu(mask,1).any(0)]

方法2：使用pdist（'cityblock'）

对于大型阵列，我们将使用前面的方法遇到内存问题。因此，作为另一种方法，我们可以利用pdist的'cityblock'来计算曼哈顿距离，然后以平方距离矩阵的形式识别相应的row/col，而不需要实际计算该形式，而是使用searchsorted来获得有效的解。你知道吗

以下是实施方案-

from scipy.spatial.distance import pdist

n = a.shape[0]
dists = pdist(a, 'cityblock')
idx = np.flatnonzero(dists < thresh)
sep_idx = np.arange(n-1,0,-1).cumsum()
rm_idx = np.unique(np.searchsorted(sep_idx,idx,'right'))
a_out = np.delete(a,rm_idx,axis=0)

标杆管理

接近-

# Approach#2 from this post
def remove_similar_rows(a, thresh=0.01):
    n = a.shape[0]
    dists = pdist(a, 'cityblock')
    idx = np.flatnonzero(dists < thresh)
    sep_idx = np.arange(n-1,0,-1).cumsum()
    rm_idx = np.unique(np.searchsorted(sep_idx,idx,'right'))
    return np.delete(a,rm_idx,axis=0)

# @John Zwinck's soln
def pairwise_manhattan_distances(a, thresh=0.01):
    d = manhattan_distances(a)
    return a[~np.any(np.tril(d < thresh), axis=0)]

计时-

In [209]: a = np.random.randint(0,9,(4000,30))

# Let's set 100 rows randomly as dups    
In [210]: idx0 = np.random.choice(4000,size=100, replace=0)

In [211]: idx1 = np.random.choice(4000,size=100, replace=0)

In [217]: a[idx0] = a[idx1]

In [238]: %timeit pairwise_manhattan_distances(a, thresh=0.01)
1 loops, best of 3: 225 ms per loop

In [239]: %timeit remove_similar_rows(a, thresh=0.01)
10 loops, best of 3: 100 ms per loop

标杆管理

相关问题更多 >

编程相关推荐

热门问题

热门文章