通过多列对数据帧中的连续项进行群集/分组问题的回答

通过多列对数据帧中的连续项进行群集/分组

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

<h3>问题</h3> 假设我有k个标量列，并且我想将条目分组，如果它们沿着每列彼此之间的距离在一定范围内 假设simpicity k为2，它们是我唯一的列 <pre><code>pd.DataFrame(list(zip(sorted(choices(range(0,10), k=20)), choices(range(20,29), k=20))), columns=['a','b']) </code></pre> 屈服 <pre><code>[(1, 27), (1, 27), (1, 21), (2, 23), (3, 25), (4, 23), (4, 28), (4, 27), (4, 22), (4, 24), (5, 26), (6, 21), (7, 26), (7, 20), (8, 24), (8, 25), (8, 23), (9, 20), (9, 28), (9, 21)] </code></pre> 我需要分组，以便分组包括<code>a</code>列中最多相隔<code>m</code>的条目和<code>b</code>列中最多相隔<code>n</code>的条目。如果<code>m</code>=<code>n</code>=1，则集群将为： <pre><code>(1, 27), (1, 27) (1, 21) (2, 23) (3, 25), (4, 23), (4, 22), (4, 24) (4, 28), (4, 27), (5, 26) (6, 21), (7, 20) (7, 26), (8, 24), (8, 25), (8, 23) (9, 20), (9, 21) (9, 28), </code></pre> <h3>注释</h3> 实现这一点的一种方法是使用<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html" rel="nofollow noreferrer">pdist</a>，但这不是一个好的解决方案，因为： <ul> <li>我有很多数据，不想做平方运算</李> <li>数据已经排序，m，n相对于列的范围较小</li> <li>m=/=n（不总是），否则m+n的曼哈顿距离阈值将起作用</li> </ul> 我相信这可能是一个非常相关的问题，但它没有一个普遍的答案： <ul> <li><a href="https://stackoverflow.com/questions/47675262/group-by-continuous-indexes-in-pandas-dataframe">Group by continuous indexes in Pandas DataFrame</a></li> </ul> 一种可能帮助您找到答案的方法的草图： <pre><code>a, b, c, d, e = tee(range(10), 5) next(b, None) next(c, None);next(c, None) next(d, None);next(d, None);next(d, None) next(e, None);next(e, None);next(e, None);next(e, None) list(zip(a, b, c, d, e)) [(0, 1, 2, 3, 4), (1, 2, 3, 4, 5), (2, 3, 4, 5, 6), (3, 4, 5, 6, 7), (4, 5, 6, 7, 8), (5, 6, 7, 8, 9)] </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

首先，我们用<code>metric = 'chebyshev'</code>做<code>pdist</code> <pre><code>test = np.array([(1, 27), (1, 27), (1, 21), (2, 23), (3, 25), (4, 23), (4, 28), (4, 27), (4, 22), (4, 24), (5, 26), (6, 21), (7, 26), (7, 20), (8, 24), (8, 25), (8, 23), (9, 20), (9, 28), (9, 21)]) from scipy.spatial.distance import pdist, squareform c_mat = squareform(pdist(test, metric = 'chebyshev')) <= 1 </code></pre> 现在<code>c_mat</code>基本上是一个节点图，如果它们是&lt；每个方向都有一个间隔 要找到完整的未连接图，在<code>networx</code>中可能有一个快速操作，但我将在<code>numpy</code>中以稍微困难的方式进行操作，因为我不知道在那里要查找什么图论关键字 <pre><code>out = np.ones((c_mat.shape[0], 2)) while out.sum(0).max() >1: c_mat = c_mat @ c_mat out = np.unique(c_mat, axis = 0) </code></pre> 现在<code>c_mat</code>是<code>True</code>，如果有任何连接两行的链，<code>out</code>是所有单独组的布尔索引。现在返回结果： <pre><code>for mask in list(out): print(np.unique(test[mask], axis = 0)) [[ 9 28]] [[ 9 20] [ 9 21]] [[ 7 26] [ 8 23] [ 8 24] [ 8 25]] [[ 6 21] [ 7 20]] [[ 4 27] [ 4 28] [ 5 26]] [[ 3 25] [ 4 22] [ 4 23] [ 4 24]] [[ 2 23]] [[ 1 21]] [[ 1 27]] </code></pre> 您还可以使用这些布尔索引来访问原始<code>DataFrame</code>中的数据行 编辑1: 现在，我们可以利用输入是半排序的这一事实来加快速度。但要做到这一点，我们需要<code>numba</code> <pre><code>from numba import jit @jit def find_connected(data, dist = 1): i = list(range(data.shape[0])) j = list(range(data.shape[0])) l = data.shape[0] for x in range(1, l): for y in range(x, l): v = np.abs(data[x] - data[y]) if v.max() <= dist: i += [x, y] j += [y, x] if v.min() > dist: break d = [1] * len(i) return (d, (i, j)) </code></pre> 现在我们需要将其加载到稀疏矩阵中 <pre><code>from scipy.sparse import csr_matrix c_mat = csr_matrix(find_connected(test), dtype = bool) </code></pre> <code>csr</code>做点积要快得多，所以<code>c_mat = c_mat @ c_mat</code>可以工作，但停止条件中断。您可以使用Anreas K.的优秀答案<a href="https://stackoverflow.com/questions/46126840/get-unique-rows-from-a-scipy-sparse-matrix">here</a>，也可以直接使用<code>out = np.unique(c_mat.todense(), axis = 0)</code> 编辑2: 在我解决这个问题之前，我无法从脑海中摆脱出来，除非我没有制作一个稠密的矩阵 <pre><code>import numba as nb import numpy as np @nb.njit def find_connected_semisort(data, dist = 1): l = data.shape[0] out = [] for x in range(l): for y in range(x, l): v = np.abs(data[x] - data[y]) if v.max() <= dist: out.append(set([x, y])) if v.min() > dist: break outlen = len(out) for x in range(outlen): for y in range(x + 1, outlen): if len(out[x] & out[y]) > 0: out[y] |= out[x] out[x].clear() return [list(i) for i in out if len(i) > 0] [np.unique(test[i], axis = 0).squeeze() for i in find_connected_semisort(test)] Out[]: [array([ 1, 27]), array([ 1, 21]), array([ 2, 23]), array([[ 3, 25], [ 4, 22], [ 4, 23], [ 4, 24]]), array([[ 4, 27], [ 4, 28], [ 5, 26]]), array([[ 6, 21], [ 7, 20]]), array([[ 7, 26], [ 8, 23], [ 8, 24], [ 8, 25]]), array([ 9, 28]), array([[ 9, 20], [ 9, 21]])] </code></pre> 也许有办法不用两个循环就能完成，但我无法摸索

通过多列对数据帧中的连续项进行群集/分组

1 个回答

相关Python问题