<p>首先,我们用<code>metric = 'chebyshev'</code>做<code>pdist</code></p>
<pre><code>test = np.array([(1, 27),
(1, 27),
(1, 21),
(2, 23),
(3, 25),
(4, 23),
(4, 28),
(4, 27),
(4, 22),
(4, 24),
(5, 26),
(6, 21),
(7, 26),
(7, 20),
(8, 24),
(8, 25),
(8, 23),
(9, 20),
(9, 28),
(9, 21)])
from scipy.spatial.distance import pdist, squareform
c_mat = squareform(pdist(test, metric = 'chebyshev')) <= 1
</code></pre>
<p>现在<code>c_mat</code>基本上是一个节点图,如果它们是<;每个方向都有一个间隔</p>
<p>要找到完整的未连接图,在<code>networx</code>中可能有一个快速操作,但我将在<code>numpy</code>中以稍微困难的方式进行操作,因为我不知道在那里要查找什么图论关键字</p>
<pre><code>out = np.ones((c_mat.shape[0], 2))
while out.sum(0).max() >1:
c_mat = c_mat @ c_mat
out = np.unique(c_mat, axis = 0)
</code></pre>
<p>现在<code>c_mat</code>是<code>True</code>,如果有任何连接两行的链,<code>out</code>是所有单独组的布尔索引。现在返回结果:</p>
<pre><code>for mask in list(out):
print(np.unique(test[mask], axis = 0))
[[ 9 28]]
[[ 9 20]
[ 9 21]]
[[ 7 26]
[ 8 23]
[ 8 24]
[ 8 25]]
[[ 6 21]
[ 7 20]]
[[ 4 27]
[ 4 28]
[ 5 26]]
[[ 3 25]
[ 4 22]
[ 4 23]
[ 4 24]]
[[ 2 23]]
[[ 1 21]]
[[ 1 27]]
</code></pre>
<p>您还可以使用这些布尔索引来访问原始<code>DataFrame</code>中的数据行</p>
<p><strong>编辑1:</strong></p>
<p>现在,我们可以利用输入是半排序的这一事实来加快速度。但要做到这一点,我们需要<code>numba</code></p>
<pre><code>from numba import jit
@jit
def find_connected(data, dist = 1):
i = list(range(data.shape[0]))
j = list(range(data.shape[0]))
l = data.shape[0]
for x in range(1, l):
for y in range(x, l):
v = np.abs(data[x] - data[y])
if v.max() <= dist:
i += [x, y]
j += [y, x]
if v.min() > dist:
break
d = [1] * len(i)
return (d, (i, j))
</code></pre>
<p>现在我们需要将其加载到稀疏矩阵中</p>
<pre><code>from scipy.sparse import csr_matrix
c_mat = csr_matrix(find_connected(test), dtype = bool)
</code></pre>
<p><code>csr</code>做点积要快得多,所以<code>c_mat = c_mat @ c_mat</code>可以工作,但停止条件中断。您可以使用Anreas K.的优秀答案<a href="https://stackoverflow.com/questions/46126840/get-unique-rows-from-a-scipy-sparse-matrix">here</a>,也可以直接使用<code>out = np.unique(c_mat.todense(), axis = 0)</code></p>
<p><strong>编辑2:</strong></p>
<p>在我解决这个问题之前,我无法从脑海中摆脱出来,除非我没有制作一个稠密的矩阵</p>
<pre><code>import numba as nb
import numpy as np
@nb.njit
def find_connected_semisort(data, dist = 1):
l = data.shape[0]
out = []
for x in range(l):
for y in range(x, l):
v = np.abs(data[x] - data[y])
if v.max() <= dist:
out.append(set([x, y]))
if v.min() > dist:
break
outlen = len(out)
for x in range(outlen):
for y in range(x + 1, outlen):
if len(out[x] & out[y]) > 0:
out[y] |= out[x]
out[x].clear()
return [list(i) for i in out if len(i) > 0]
[np.unique(test[i], axis = 0).squeeze() for i in find_connected_semisort(test)]
Out[]:
[array([ 1, 27]), array([ 1, 21]), array([ 2, 23]), array([[ 3, 25],
[ 4, 22],
[ 4, 23],
[ 4, 24]]), array([[ 4, 27],
[ 4, 28],
[ 5, 26]]), array([[ 6, 21],
[ 7, 20]]), array([[ 7, 26],
[ 8, 23],
[ 8, 24],
[ 8, 25]]), array([ 9, 28]), array([[ 9, 20],
[ 9, 21]])]
</code></pre>
<p>也许有办法不用两个循环就能完成,但我无法摸索</p>