在pandas中优化循环问题的回答

在pandas中优化循环

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个数据框，有两列是点的坐标。如果某个点位于特定位置，我需要用特定值填充一列（全部为无）。该位置和标签存储在另一个df中 这不容易解释，但我希望通过一个例子可以清楚地说明： DF1 <pre><code> latitude longitude LABEL 0 1.3 2.7 None 1 3.5 3.6 None 2 2.8 3.0 None 3 9.7 1.9 None 4 6.2 5.7 None 5 1.7 3.4 None 6 3.5 1.4 None 7 2.7 6.6 None 8 1.7 2.7 None 9 1.3 1.3 None </code></pre> DF2 <pre><code> minlat maxlat minlong maxlong STRING 0 1.0 2.0 1.0 3.0 AAA 1 3.0 4.0 1.0 2.0 BBB 2 3.0 4.0 3.0 4.0 CCC 3 5.0 7.0 2.0 3.0 DDD </code></pre> 最终结果是： <pre><code> latitude longitude LABEL 0 1.3 2.7 AAA 1 3.5 3.6 CCC 2 2.8 3.0 None 3 9.7 1.9 None 4 6.2 5.7 None 5 1.7 3.4 None 6 3.5 1.4 BBB 7 2.7 6.6 None 8 1.7 2.7 AAA 9 1.3 1.3 AAA </code></pre> 目前的代码是： <pre><code>for i in range(len(df2)-1): DF1.loc[(DF1['latitude']>=DF2.loc[i:i,'minlat'].at[i]) & (DF1['latitude']<DF2.loc[i:i,'maxlat'].at[i]) & (DF1['longitude']>=DF2.loc[i:i,'minlong'].at[i]) & (DF1['longitude']<DF2.loc[i:i,'maxlong'].at[i]),'LABEL'] = DF2.loc[i:i,'STRING'].at[i] </code></pre> 屏幕以获得更好的缩进： <a href="https://i.stack.imgur.com/pF3xM.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/pF3xM.png" alt="enter image description here"/></a> 因此，对于DF2的每一行，我检查DF1的值是否在中间，并分配一个字符串 但是像这样需要很多时间。你对我能做什么有什么建议吗？我的问题是，必须用DF2的每一行检查数字_1的每个值，而不仅仅是用具有相同索引的行 编辑：我正在尝试其他方法： (二) <pre><code>for i in range(len(xlsx_fact_maneuver_specialareas)-1): minLat=DF2.loc[i:i,'minLat'].at[i] maxLat=DF2.loc[i:i,'maxLat'].at[i] minLong=DF2.loc[i:i,'maxLat'].at[i] maxLong=DF2.loc[i:i,'maxLong'].at[i] DF1.loc[(DF1['latitude']>=minLat) & (DF1['latitude']<maxLat) & (DF1['longitude']>=minLong) & (DF1['longitude']<maxLong),'LABEL'] = DF2.loc[i:i,'STRING'].at[i] </code></pre> 这让我在本地感觉不太好，但当我在机器上尝试时，感觉更好 及 <pre><code>for i in range(len(xlsx_fact_maneuver_specialareas)-1): minLat=DF2.loc[i:i,'minLat'].at[i] maxLat=DF2.loc[i:i,'maxLat'].at[i] minLong=DF2.loc[i:i,'maxLat'].at[i] maxLong=DF2.loc[i:i,'maxLong'].at[i] DF1 = DF1.assign( label = np.select( [(DF1['latitude']>=minLat) & (DF1['latitude']<maxLat) & (DF1['longitude']>=minLong) & (DF1['longitude']<maxLong)], [DF2.loc[i:i,'STRING'].at[i]], [None])) </code></pre> 这让我在本地感觉更好，但在机器上感觉更差

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

将此操作矢量化的一个解决方案是使用Numpy及其出色的广播功能。这为中小型数据帧提供了一个快速的解决方案，但它会随着<code>mask</code>的<code>O[n*m]</code>（对于<code>df1</code>的<code>n</code>行和<code>df2</code>行的<code>m</code>行的<code>mask</code>而增长（在时间和内存上），因此最终对于大型数据帧来说速度会变慢 <pre class="lang-py prettyprint-override"><code>a = df1[['latitude', 'longitude']].values vmin = df2[['minlat', 'minlong']].values vmax = df2[['maxlat', 'maxlong']].values mask = (vmin[None, :, :] <= a[:, None, :]).all(2) & (a[:, None, :] <= vmax[None, :, :]).all(2) has_any = mask.any(1) first = mask.argmax(axis=1) label = np.full(len(df1), None, dtype=object) label[has_any] = df2['STRING'].values[first[has_any]] >>> df1.assign(LABEL=label) latitude longitude LABEL 0 1.3 2.7 AAA 1 3.5 3.6 CCC 2 2.8 3.0 None 3 9.7 1.9 None 4 6.2 5.7 None 5 1.7 3.4 None 6 3.5 1.4 BBB 7 2.7 6.6 None 8 1.7 2.7 AAA 9 1.3 1.3 AAA </code></pre> 解释 关键部分是<code>mask</code>的构造。有必要对其进行细分，以了解其机制及其如何使用Numpy的广播： <pre class="lang-py prettyprint-override"><code>>>> vmin[None, :, :] <= a[:, None, :] [[[ True True] [False True] [False False] [False True]] [[ True True] [ True True] [ True True] [False True]] ... [[ True True] [False True] [False False] [False False]]] </code></pre> 如您所见，上面将<code>a</code>和<code>vmin</code>之间的所有比较扩展到第三维。然后我们用逻辑“所有第三轴（经度和纬度）都必须为真”投射回2D： <pre class="lang-py prettyprint-override"><code>>>> (vmin[None, :, :] <= a[:, None, :]).all(2) [[ True False False False] [ True True True False] [ True False False False] [ True True False False] [ True True True True] [ True False False False] [ True True False False] [ True False False False] [ True False False False] [ True False False False]] </code></pre> 以上表示高于<code>df2.iloc[j]</code>最小值的所有点<code>df1.iloc[i]</code>为<code>...[i, j]</code> 我们对<code>vmax</code>做了同样的处理，得到的<code>mask</code>是<code>df1.iloc[i]</code>的所有点都在<code>df2.iloc[j]</code>的边界框中 接下来的两位是<code>has_any</code>和<code>first</code>。前者表示<code>df1</code>中的哪些点至少位于一个边界框中。后者是第一个这样的边界框（如<code>df2</code>中的索引） 其余的都是不言自明的 注释 请注意，这使用了<code>O[n*m]</code>比较（对于<code>df1</code>的<code>n</code>行和<code>df2</code>的<code>m</code>行），这对于大型矩阵来说可能太慢（尽管因为它是矢量化的，所以对于中型矩阵来说速度非常快） 对于大型矩阵，更好的方法包括排序或使用KD树。见<a href="https://stackoverflow.com/a/68927974/758174">this other answer</a>

在pandas中优化循环

1 个回答

相关Python问题