在pandas中优化循环问题的回答

在pandas中优化循环

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个数据框，有两列是点的坐标。如果某个点位于特定位置，我需要用特定值填充一列（全部为无）。该位置和标签存储在另一个df中 这不容易解释，但我希望通过一个例子可以清楚地说明： DF1 <pre><code> latitude longitude LABEL 0 1.3 2.7 None 1 3.5 3.6 None 2 2.8 3.0 None 3 9.7 1.9 None 4 6.2 5.7 None 5 1.7 3.4 None 6 3.5 1.4 None 7 2.7 6.6 None 8 1.7 2.7 None 9 1.3 1.3 None </code></pre> DF2 <pre><code> minlat maxlat minlong maxlong STRING 0 1.0 2.0 1.0 3.0 AAA 1 3.0 4.0 1.0 2.0 BBB 2 3.0 4.0 3.0 4.0 CCC 3 5.0 7.0 2.0 3.0 DDD </code></pre> 最终结果是： <pre><code> latitude longitude LABEL 0 1.3 2.7 AAA 1 3.5 3.6 CCC 2 2.8 3.0 None 3 9.7 1.9 None 4 6.2 5.7 None 5 1.7 3.4 None 6 3.5 1.4 BBB 7 2.7 6.6 None 8 1.7 2.7 AAA 9 1.3 1.3 AAA </code></pre> 目前的代码是： <pre><code>for i in range(len(df2)-1): DF1.loc[(DF1['latitude']>=DF2.loc[i:i,'minlat'].at[i]) & (DF1['latitude']<DF2.loc[i:i,'maxlat'].at[i]) & (DF1['longitude']>=DF2.loc[i:i,'minlong'].at[i]) & (DF1['longitude']<DF2.loc[i:i,'maxlong'].at[i]),'LABEL'] = DF2.loc[i:i,'STRING'].at[i] </code></pre> 屏幕以获得更好的缩进： <a href="https://i.stack.imgur.com/pF3xM.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/pF3xM.png" alt="enter image description here"/></a> 因此，对于DF2的每一行，我检查DF1的值是否在中间，并分配一个字符串 但是像这样需要很多时间。你对我能做什么有什么建议吗？我的问题是，必须用DF2的每一行检查数字_1的每个值，而不仅仅是用具有相同索引的行 编辑：我正在尝试其他方法： (二) <pre><code>for i in range(len(xlsx_fact_maneuver_specialareas)-1): minLat=DF2.loc[i:i,'minLat'].at[i] maxLat=DF2.loc[i:i,'maxLat'].at[i] minLong=DF2.loc[i:i,'maxLat'].at[i] maxLong=DF2.loc[i:i,'maxLong'].at[i] DF1.loc[(DF1['latitude']>=minLat) & (DF1['latitude']<maxLat) & (DF1['longitude']>=minLong) & (DF1['longitude']<maxLong),'LABEL'] = DF2.loc[i:i,'STRING'].at[i] </code></pre> 这让我在本地感觉不太好，但当我在机器上尝试时，感觉更好 及 <pre><code>for i in range(len(xlsx_fact_maneuver_specialareas)-1): minLat=DF2.loc[i:i,'minLat'].at[i] maxLat=DF2.loc[i:i,'maxLat'].at[i] minLong=DF2.loc[i:i,'maxLat'].at[i] maxLong=DF2.loc[i:i,'maxLong'].at[i] DF1 = DF1.assign( label = np.select( [(DF1['latitude']>=minLat) & (DF1['latitude']<maxLat) & (DF1['longitude']>=minLong) & (DF1['longitude']<maxLong)], [DF2.loc[i:i,'STRING'].at[i]], [None])) </code></pre> 这让我在本地感觉更好，但在机器上感觉更差

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

这是另一个答案，这个答案使用了<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html" rel="nofollow noreferrer">^{<cd1>}</a>。如果边界框之间没有太多重叠，并且它们的“半径”分布也不太“随意”（<code>radius.max() / np.median(radius)</code>不太大，例如在1和2之间），则该方法特别有效 它适用于p-范数1（曼哈顿）或2（欧几里德），尽管在实践中{<cd3>}更快，因为平均每个点看到的候选点更少（圆的总面积小于钻石的总面积） 为什么这么快<a href="https://en.wikipedia.org/wiki/K-d_tree" rel="nofollow noreferrer">KD-trees</a>非常适合处理这类问题。它们通过沿维度和中点在每个节点上分割空间来划分空间。一旦构建了它们，由于它们提供了分而治之的方法，查询它们的速度很快 关键功能如下： <pre class="lang-py prettyprint-override"><code>def find(kd_boxes, kd_points, bb, r, p): # bb: m bounding boxes, in the form [x0,x1,y0,y1] # find one bb for each point (the first one), or m if none xy = kd_points.data found = np.full(len(xy), len(bb), dtype=int) cand = pad_jagged(kd_points.query_ball_tree(kd_boxes, r * 1.001, p=p)) remaining = None for c in range(cand.shape[1]): remaining = np.nonzero(cand[:, c] >= 0)[0] if remaining is None else remaining[np.nonzero(cand[remaining, 1] >= 0)[0]] if remaining.size == 0: break i = remaining j = cand[i, c] ok = (bb[j, 0::2] <= xy[i]).all(1) & (xy[i] <= bb[j, 1::2]).all(1) k = np.nonzero(ok)[0] found[i[k]] = j[k] remaining = remaining[~ok] return found </code></pre> 在调用此函数之前，我们为每个边界框计算一个“半径”，该半径是对角线p范数的一半。然后，我们使用总体最大半径<code>r</code>作为<code>KDTree</code>查询的最大距离。<code>KDTree</code>查询（<code>kd_points.query_ball_tree()</code>）有效地筛选所有边界框中心，并在一次调用中查找半径内的所有边界框中心。这是实际匹配的超集，但速度很快，大大减少了搜索空间。然后，过滤实际上在边界框中的点，并跟踪每个点的（第一个）匹配边界框 作为优化（这里没有实现），我们可以考虑边界框的大小（^ {< CD8>}数组）。如果判断差异太大，则可以将边界框分为两组（例如围绕<code>np.median(radius)</code>），并且可以对每一半进行相同的搜索（如果必要，再次递归） 对于OP示例，在准备以更易于使用的形式（所有Numpy数组）获取中心、边界框和半径后，该查询返回： <pre class="lang-py prettyprint-override"><code># preparation xy = df1[['longitude', 'latitude']].values bb = df2[['minlong', 'maxlong', 'minlat', 'maxlat']].values centers = bb @ np.array([[1,1,0,0], [0,0,1,1]]).T / 2 radius = np.linalg.norm(bb @ np.array([[-1,1,0,0], [0,0,-1,1]]).T, axis=1) / 2 kd_boxes = KDTree(centers) kd_points = KDTree(xy) # query >>> kd_points.query_ball_tree(kd_boxes, radius.max() * 1.001) [[0], [2], [2], [], [], [], [1], [], [0], [0]] </code></pre> 使用曼哈顿范数可以得到类似的结果（在本例中，相同）： <pre class="lang-py prettyprint-override"><code>radius = bb @ np.array([-1,1,-1,1]) / 2 >>> kd_points.query_ball_tree(kd_boxes, radius.max() * 1.001, p=1) [[0], [2], [2], [], [], [], [1], [], [0], [0]] </code></pre> 下图中绘制了这两种解决方案，其中突出显示了所有候选边界框及其中心距离<code>r</code>内的实际面积： <img src="https://i.stack.imgur.com/viQEi.png" width="400"/> 请注意，在这两种情况下，点<code>#2</code>是如何错误地包含在bbox<code>#2</code>中的。第一步，使用KD树，只是一个过滤步骤。下一步是检查每个点的候选是否实际包含该点。使用该筛选（用于<code>ok</code>数组的表达式），解决方案是： <img src="https://i.stack.imgur.com/sgljb.png" width="400"/> 让我们看看如果我们有更多的点和边界框（可能有重叠），会发生什么： <pre class="lang-py prettyprint-override"><code>def gen_example(n, m, seed=None): # let's make a more complex problem, with overlaps if seed is not None: np.random.seed(seed) df1 = pd.DataFrame({ 'latitude': np.random.uniform(0, 1 + 1 / np.sqrt(m), size=n), 'longitude': np.random.uniform(0, 1 + 1 / np.sqrt(m), size=n), }) df2 = pd.DataFrame({ 'minlat': np.random.uniform(size=m), 'maxlat': np.random.uniform(.5, 1, size=m) / np.sqrt(m), 'minlong': np.random.uniform(size=m), 'maxlong': np.random.uniform(.5, 1, size=m) / np.sqrt(m), 'STRING': [f'b_{i}' for i in range(m)], }) df2['maxlat'] += df2['minlat'] df2['maxlong'] += df2['minlong'] return df1, df2 </code></pre> 对于<code>df1, df2 = gen_example(32, 12, 0)</code>，相应的图片为： <img src="https://i.stack.imgur.com/fogok.png" width="400"/> <img src="https://i.stack.imgur.com/a8Am8.png" width="400"/> 候选项（查询结果，此处为<code>p=2</code>）为： <pre class="lang-py prettyprint-override"><code>>>> kd_cand = kd_points.query_ball_tree(kd_boxes, r * 1.001, p=2) [[], [], [9], [], [], [10], [], [], [], [], [], [9], [10], [], [11], [11], [], [2, 6], [8], [8], [], [4], [7, 9], [4], [3, 5], [2, 4], [0], [], [7, 9], [7, 9], [0, 3, 5], [4]] </code></pre> 因为这是一个锯齿状数组，所以我们将其转换为带有<code>fill_value=-1</code>的矩形数组： <pre class="lang-py prettyprint-override"><code>def pad_jagged(a, fill_value=-1): lens = np.array([len(r) for r in a]) v = np.array([e for r in a for e in r]) w = np.max(lens) return np.insert( v, np.repeat(np.cumsum(lens), w - lens), fill_value ).reshape(-1, w) </code></pre> 对于上面的填充阵列，这将提供： <pre class="lang-py prettyprint-override"><code>>>> pad_jagged(kd_cand) [[-1 -1 -1] [ 8 -1 -1] [ 9 -1 -1] ... [ 7 9 -1] [ 0 3 5] [ 4 -1 -1]] </code></pre> 现在，我们迭代这个数组的列，但是以贪婪的方式从以前的迭代中删除任何成功的匹配项（这就是<code>remaining</code>的原因） 处理预处理等的其他功能包括： <pre class="lang-py prettyprint-override"><code>def find_regions(xy, bb, p=2): # xy: numpy array (n, 2) # bb: numpy array (m, 4): the four columns are xmin, xmax, ymin, ymax # for each point in xy, return the index of a region that contains it, or -1 centers = bb @ np.array([[1,1,0,0], [0,0,1,1]]).T / 2 assert p in {1,2}, "only 1- or 2-norm supported" radius = bb @ np.array([-1,1,-1,1]) / 2 if p == 1 else np.linalg.norm(bb @ np.array([[-1,1,0,0], [0,0,-1,1]]).T, axis=1) / 2 kd_boxes = KDTree(centers) kd_points = KDTree(xy) return find(kd_boxes, kd_points, bb, radius.max(), p=p) def find_region_label(df1, df2, nomatch=None, p=2): found = find_regions( df1[['longitude', 'latitude']].values, df2[['minlong', 'maxlong', 'minlat', 'maxlat']].values, p=p, ) lbl = np.r_[df2['STRING'].values, [nomatch]] return lbl[found] </code></pre> 关于OP示例： <pre><code>>>> df1.assign(LABEL=find_region_label(df1, df2)) latitude longitude LABEL 0 1.3 2.7 AAA 1 3.5 3.6 CCC 2 2.8 3.0 None 3 9.7 1.9 None 4 6.2 5.7 None 5 1.7 3.4 None 6 3.5 1.4 BBB 7 2.7 6.6 None 8 1.7 2.7 AAA 9 1.3 1.3 AAA </code></pre> 速度测试 现在进行速度测试，其中点数<code>n</code>和边界框数<code>m</code>都较大： <pre class="lang-py prettyprint-override"><code>df1, df2 = gen_example(100_000, 10_000, 0) </code></pre> 与<a href="https://stackoverflow.com/a/68926522/758174">@norok2's numba solution</a>的速度比较（稍微调整以跟随OP的列名）： <pre class="lang-py prettyprint-override"><code>a = %timeit -o find_region_label(df1, df2) # 222 ms ± 904 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) b = %timeit -o locate_in_regions_nb(df1, df2) # 5.38 s ± 40.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) >>> b.average / a.average 24.255 </code></pre> 这是该数据的24倍加速。 验证我们是否得到相同的解决方案： <pre class="lang-py prettyprint-override"><code>sa = find_region_label(df1, df2) sb = locate_in_regions_nb(df1, df2) >>> (sa == sb).all() True </code></pre>

在pandas中优化循环

1 个回答

相关Python问题