<p>这个问题并不特别适合在Pandas本身中解决,因为没有简单的原语来处理您需要进行的计算。
更好的方法是转移到NumPy或Numba域,在较低级别上解决问题</p>
<p>我将提供生成最后一列的函数,假设将最后一列复制到数据帧中相对容易</p>
<p>最初的做法是:</p>
<pre><code>def locate_in_regions_OP(points, regions):
result = np.full(len(points), None, dtype=object)
for i in range(len(regions) - 1):
result[
(points['lat'] >= regions.loc[i:i, 'min_lat'].at[i])
& (points['lat'] < regions.loc[i:i,'max_lat'].at[i])
& (points['lon'] >= regions.loc[i:i, 'min_lon'].at[i])
& (points['lon'] < regions.loc[i:i, 'max_lon'].at[i])
] = regions.loc[i:i, 'lbl'].at[i]
return result
</code></pre>
<p>这将为最后一列生成正确的结果。
(OP中提出的其他方法要么不相关,要么仅对仅使用一次的数量使用独立赋值,要么我没有设法让它们工作)</p>
<p>一种相对简单的方法涉及广播,在<a href="https://stackoverflow.com/a/68923402/5218354">@PierreD answer</a>中介绍,可以进一步简化为:</p>
<pre><code>import numpy as np
def locate_in_regions_bc(points, regions):
pos_arr = points[['lat', 'lon']].values
min_arr = regions[['min_lat', 'min_lon']].values
max_arr = regions[['max_lat', 'max_lon']].values
mask = (
np.all(min_arr[None, :, :] <= pos_arr[:, None, :], axis=2)
& np.all(pos_arr[:, None, :] < max_arr[None, :, :], axis=2))
has_any = np.any(mask, axis=1)
first = np.argmax(mask, axis=1)
result = np.full(len(points), None, dtype=object)
result[has_any] = regions.loc[first[has_any], 'lbl'].values
return result
</code></pre>
<p>在假设每个位置只属于一个区域的情况下,这可以稍微简化:</p>
<pre><code>import numpy as np
def locate_in_regions_bcu(points, regions):
pos_arr = points[['lat', 'lon']].values
min_arr = regions[['min_lat', 'min_lon']].values
max_arr = regions[['max_lat', 'max_lon']].values
mask = (
np.all(min_arr[None, :, :] <= pos_arr[:, None, :], axis=2)
& np.all(pos_arr[:, None, :] < max_arr[None, :, :], axis=2))
pos, lbl = np.where(mask)
result = np.full(len(points), None, dtype=object)
result[pos] = regions.loc[lbl, 'lbl'].values
return result
</code></pre>
<p>但是,有大量不必要的内存分配和比较正在进行。
一种更快的方法是使用Numba显式循环,您可以显式添加短路。该守则的内容如下:</p>
<pre><code>import numpy as np
import numba as nb
def locate_in_regions_nb(points, regions):
pos_arr = points[['lat', 'lon']].values
min_arr = regions[['min_lat', 'min_lon']].values
max_arr = regions[['max_lat', 'max_lon']].values
found_arr = _locate_in_regions_nb(pos_arr, min_arr, max_arr)
mask = found_arr >= 0
result = np.full(len(points), None, dtype=object)
result[mask] = regions.loc[found_arr[mask], 'lbl'].values
return result
@nb.jit
def _locate_in_regions_nb(pos_arr, min_arr, max_arr):
n, l = pos_arr.shape
m = min_arr.shape[0]
found_arr = np.full((n,), -1)
for i in range(n):
for j in range(m):
contained = True
for k in range(l):
if min_arr[j, k] > pos_arr[i, k] or pos_arr[i, k] >= max_arr[j, k]:
contained = False
break
if contained:
found_arr[i] = j
break
return found_arr
</code></pre>
<p>使用稍微干净但在其他方面具有可比性的输入:</p>
<pre><code>import numpy as np
import pandas as pd
import numba as nb
df1 = pd.DataFrame(
columns=['lat', 'lon', 'lbl'],
data=[
[1.3, 2.7, None],
[3.5, 3.6, None],
[2.8, 3.0, None],
[9.7, 1.9, None],
[1.7, 3.4, None],
[3.5, 1.4, None],
[2.7, 6.6, None],
[1.7, 2.7, None],
[1.3, 1.3, None],
])
df2 = pd.DataFrame(
columns=['min_lat', 'max_lat', 'min_lon', 'max_lon', 'lbl'],
data=[
[1.0, 2.0, 1.0, 3.0, 'AAA'],
[3.0, 4.0, 1.0, 2.0, 'BBB'],
[3.0, 4.0, 3.0, 4.0, 'CCC'],
[5.0, 7.0, 2.0, 3.0, 'DDD'],
])
print(df1)
# lat lon lbl
# 0 1.3 2.7 None
# 1 3.5 3.6 None
# 2 2.8 3.0 None
# 3 9.7 1.9 None
# 4 1.7 3.4 None
# 5 3.5 1.4 None
# 6 2.7 6.6 None
# 7 1.7 2.7 None
# 8 1.3 1.3 None
print(df2)
# min_lat max_lat min_lon max_lon lbl
# 0 1.0 2.0 1.0 3.0 AAA
# 1 3.0 4.0 1.0 2.0 BBB
# 2 3.0 4.0 3.0 4.0 CCC
# 3 5.0 7.0 2.0 3.0 DDD
</code></pre>
<p>在所有情况下都可以获得预期输出:</p>
<pre><code>print(locate_in_regions_OP(df1, df2))
# ['AAA' 'CCC' None None None 'BBB' None 'AAA' 'AAA']
print(locate_in_regions_bc(df1, df2))
# ['AAA' 'CCC' None None None 'BBB' None 'AAA' 'AAA']
print(locate_in_regions_bcu(df1, df2))
# ['AAA' 'CCC' None None None 'BBB' None 'AAA' 'AAA']
print(locate_in_regions_nb(df1, df2))
# ['AAA' 'CCC' None None None 'BBB' None 'AAA' 'AAA']
</code></pre>
<p>虽然产生代表问题的任意大的输入并不容易,但天真的时间安排表明Numba方法要快得多:</p>
<pre><code>np.random.seed(0)
df3 = pd.DataFrame(
columns=['lat', 'lon', 'lbl'],
data=np.random.random((1000000, 3)) * 5)
df3.loc[:, 'lbl'] = None
%timeit locate_in_regions_OP(df3, df2)
# 209 ms ± 7.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit locate_in_regions_bc(df3, df2)
# 161 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit locate_in_regions_bcu(df3, df2)
# 115 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit locate_in_regions_nb(df3, df2)
# 66.6 ms ± 3.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
</code></pre>