在pandas中优化循环问题的回答

在pandas中优化循环

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个数据框，有两列是点的坐标。如果某个点位于特定位置，我需要用特定值填充一列（全部为无）。该位置和标签存储在另一个df中 这不容易解释，但我希望通过一个例子可以清楚地说明： DF1 <pre><code> latitude longitude LABEL 0 1.3 2.7 None 1 3.5 3.6 None 2 2.8 3.0 None 3 9.7 1.9 None 4 6.2 5.7 None 5 1.7 3.4 None 6 3.5 1.4 None 7 2.7 6.6 None 8 1.7 2.7 None 9 1.3 1.3 None </code></pre> DF2 <pre><code> minlat maxlat minlong maxlong STRING 0 1.0 2.0 1.0 3.0 AAA 1 3.0 4.0 1.0 2.0 BBB 2 3.0 4.0 3.0 4.0 CCC 3 5.0 7.0 2.0 3.0 DDD </code></pre> 最终结果是： <pre><code> latitude longitude LABEL 0 1.3 2.7 AAA 1 3.5 3.6 CCC 2 2.8 3.0 None 3 9.7 1.9 None 4 6.2 5.7 None 5 1.7 3.4 None 6 3.5 1.4 BBB 7 2.7 6.6 None 8 1.7 2.7 AAA 9 1.3 1.3 AAA </code></pre> 目前的代码是： <pre><code>for i in range(len(df2)-1): DF1.loc[(DF1['latitude']>=DF2.loc[i:i,'minlat'].at[i]) & (DF1['latitude']<DF2.loc[i:i,'maxlat'].at[i]) & (DF1['longitude']>=DF2.loc[i:i,'minlong'].at[i]) & (DF1['longitude']<DF2.loc[i:i,'maxlong'].at[i]),'LABEL'] = DF2.loc[i:i,'STRING'].at[i] </code></pre> 屏幕以获得更好的缩进： <a href="https://i.stack.imgur.com/pF3xM.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/pF3xM.png" alt="enter image description here"/></a> 因此，对于DF2的每一行，我检查DF1的值是否在中间，并分配一个字符串 但是像这样需要很多时间。你对我能做什么有什么建议吗？我的问题是，必须用DF2的每一行检查数字_1的每个值，而不仅仅是用具有相同索引的行 编辑：我正在尝试其他方法： (二) <pre><code>for i in range(len(xlsx_fact_maneuver_specialareas)-1): minLat=DF2.loc[i:i,'minLat'].at[i] maxLat=DF2.loc[i:i,'maxLat'].at[i] minLong=DF2.loc[i:i,'maxLat'].at[i] maxLong=DF2.loc[i:i,'maxLong'].at[i] DF1.loc[(DF1['latitude']>=minLat) & (DF1['latitude']<maxLat) & (DF1['longitude']>=minLong) & (DF1['longitude']<maxLong),'LABEL'] = DF2.loc[i:i,'STRING'].at[i] </code></pre> 这让我在本地感觉不太好，但当我在机器上尝试时，感觉更好 及 <pre><code>for i in range(len(xlsx_fact_maneuver_specialareas)-1): minLat=DF2.loc[i:i,'minLat'].at[i] maxLat=DF2.loc[i:i,'maxLat'].at[i] minLong=DF2.loc[i:i,'maxLat'].at[i] maxLong=DF2.loc[i:i,'maxLong'].at[i] DF1 = DF1.assign( label = np.select( [(DF1['latitude']>=minLat) & (DF1['latitude']<maxLat) & (DF1['longitude']>=minLong) & (DF1['longitude']<maxLong)], [DF2.loc[i:i,'STRING'].at[i]], [None])) </code></pre> 这让我在本地感觉更好，但在机器上感觉更差

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

这个问题并不特别适合在Pandas本身中解决，因为没有简单的原语来处理您需要进行的计算。更好的方法是转移到NumPy或Numba域，在较低级别上解决问题 我将提供生成最后一列的函数，假设将最后一列复制到数据帧中相对容易 最初的做法是： <pre><code>def locate_in_regions_OP(points, regions): result = np.full(len(points), None, dtype=object) for i in range(len(regions) - 1): result[ (points['lat'] >= regions.loc[i:i, 'min_lat'].at[i]) & (points['lat'] < regions.loc[i:i,'max_lat'].at[i]) & (points['lon'] >= regions.loc[i:i, 'min_lon'].at[i]) & (points['lon'] < regions.loc[i:i, 'max_lon'].at[i]) ] = regions.loc[i:i, 'lbl'].at[i] return result </code></pre> 这将为最后一列生成正确的结果。（OP中提出的其他方法要么不相关，要么仅对仅使用一次的数量使用独立赋值，要么我没有设法让它们工作） 一种相对简单的方法涉及广播，在<a href="https://stackoverflow.com/a/68923402/5218354">@PierreD answer</a>中介绍，可以进一步简化为： <pre><code>import numpy as np def locate_in_regions_bc(points, regions): pos_arr = points[['lat', 'lon']].values min_arr = regions[['min_lat', 'min_lon']].values max_arr = regions[['max_lat', 'max_lon']].values mask = ( np.all(min_arr[None, :, :] <= pos_arr[:, None, :], axis=2) & np.all(pos_arr[:, None, :] < max_arr[None, :, :], axis=2)) has_any = np.any(mask, axis=1) first = np.argmax(mask, axis=1) result = np.full(len(points), None, dtype=object) result[has_any] = regions.loc[first[has_any], 'lbl'].values return result </code></pre> 在假设每个位置只属于一个区域的情况下，这可以稍微简化： <pre><code>import numpy as np def locate_in_regions_bcu(points, regions): pos_arr = points[['lat', 'lon']].values min_arr = regions[['min_lat', 'min_lon']].values max_arr = regions[['max_lat', 'max_lon']].values mask = ( np.all(min_arr[None, :, :] <= pos_arr[:, None, :], axis=2) & np.all(pos_arr[:, None, :] < max_arr[None, :, :], axis=2)) pos, lbl = np.where(mask) result = np.full(len(points), None, dtype=object) result[pos] = regions.loc[lbl, 'lbl'].values return result </code></pre> 但是，有大量不必要的内存分配和比较正在进行。一种更快的方法是使用Numba显式循环，您可以显式添加短路。该守则的内容如下： <pre><code>import numpy as np import numba as nb def locate_in_regions_nb(points, regions): pos_arr = points[['lat', 'lon']].values min_arr = regions[['min_lat', 'min_lon']].values max_arr = regions[['max_lat', 'max_lon']].values found_arr = _locate_in_regions_nb(pos_arr, min_arr, max_arr) mask = found_arr >= 0 result = np.full(len(points), None, dtype=object) result[mask] = regions.loc[found_arr[mask], 'lbl'].values return result @nb.jit def _locate_in_regions_nb(pos_arr, min_arr, max_arr): n, l = pos_arr.shape m = min_arr.shape[0] found_arr = np.full((n,), -1) for i in range(n): for j in range(m): contained = True for k in range(l): if min_arr[j, k] > pos_arr[i, k] or pos_arr[i, k] >= max_arr[j, k]: contained = False break if contained: found_arr[i] = j break return found_arr </code></pre> 使用稍微干净但在其他方面具有可比性的输入： <pre><code>import numpy as np import pandas as pd import numba as nb df1 = pd.DataFrame( columns=['lat', 'lon', 'lbl'], data=[ [1.3, 2.7, None], [3.5, 3.6, None], [2.8, 3.0, None], [9.7, 1.9, None], [1.7, 3.4, None], [3.5, 1.4, None], [2.7, 6.6, None], [1.7, 2.7, None], [1.3, 1.3, None], ]) df2 = pd.DataFrame( columns=['min_lat', 'max_lat', 'min_lon', 'max_lon', 'lbl'], data=[ [1.0, 2.0, 1.0, 3.0, 'AAA'], [3.0, 4.0, 1.0, 2.0, 'BBB'], [3.0, 4.0, 3.0, 4.0, 'CCC'], [5.0, 7.0, 2.0, 3.0, 'DDD'], ]) print(df1) # lat lon lbl # 0 1.3 2.7 None # 1 3.5 3.6 None # 2 2.8 3.0 None # 3 9.7 1.9 None # 4 1.7 3.4 None # 5 3.5 1.4 None # 6 2.7 6.6 None # 7 1.7 2.7 None # 8 1.3 1.3 None print(df2) # min_lat max_lat min_lon max_lon lbl # 0 1.0 2.0 1.0 3.0 AAA # 1 3.0 4.0 1.0 2.0 BBB # 2 3.0 4.0 3.0 4.0 CCC # 3 5.0 7.0 2.0 3.0 DDD </code></pre> 在所有情况下都可以获得预期输出： <pre><code>print(locate_in_regions_OP(df1, df2)) # ['AAA' 'CCC' None None None 'BBB' None 'AAA' 'AAA'] print(locate_in_regions_bc(df1, df2)) # ['AAA' 'CCC' None None None 'BBB' None 'AAA' 'AAA'] print(locate_in_regions_bcu(df1, df2)) # ['AAA' 'CCC' None None None 'BBB' None 'AAA' 'AAA'] print(locate_in_regions_nb(df1, df2)) # ['AAA' 'CCC' None None None 'BBB' None 'AAA' 'AAA'] </code></pre> 虽然产生代表问题的任意大的输入并不容易，但天真的时间安排表明Numba方法要快得多： <pre><code>np.random.seed(0) df3 = pd.DataFrame( columns=['lat', 'lon', 'lbl'], data=np.random.random((1000000, 3)) * 5) df3.loc[:, 'lbl'] = None %timeit locate_in_regions_OP(df3, df2) # 209 ms ± 7.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit locate_in_regions_bc(df3, df2) # 161 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit locate_in_regions_bcu(df3, df2) # 115 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit locate_in_regions_nb(df3, df2) # 66.6 ms ± 3.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) </code></pre>

在pandas中优化循环

1 个回答

相关Python问题