<p>这是另一个答案,这个答案使用了<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html" rel="nofollow noreferrer">^{<cd1>}</a>。如果边界框之间没有太多重叠,并且它们的“半径”分布也不太“随意”(<code>radius.max() / np.median(radius)</code>不太大,例如在1和2之间),则该方法特别有效</p>
<p>它适用于<em>p</em>-范数1(曼哈顿)或2(欧几里德),尽管在实践中{<cd3>}更快,因为平均每个点看到的候选点更少(圆的总面积小于钻石的总面积)</p>
<p>为什么这么快<a href="https://en.wikipedia.org/wiki/K-d_tree" rel="nofollow noreferrer">KD-trees</a>非常适合处理这类问题。它们通过沿维度和中点在每个节点上分割空间来划分空间。一旦构建了它们,由于它们提供了分而治之的方法,查询它们的速度很快</p>
<p>关键功能如下:</p>
<pre class="lang-py prettyprint-override"><code>def find(kd_boxes, kd_points, bb, r, p):
# bb: m bounding boxes, in the form [x0,x1,y0,y1]
# find one bb for each point (the first one), or m if none
xy = kd_points.data
found = np.full(len(xy), len(bb), dtype=int)
cand = pad_jagged(kd_points.query_ball_tree(kd_boxes, r * 1.001, p=p))
remaining = None
for c in range(cand.shape[1]):
remaining = np.nonzero(cand[:, c] >= 0)[0] if remaining is None else remaining[np.nonzero(cand[remaining, 1] >= 0)[0]]
if remaining.size == 0:
break
i = remaining
j = cand[i, c]
ok = (bb[j, 0::2] <= xy[i]).all(1) & (xy[i] <= bb[j, 1::2]).all(1)
k = np.nonzero(ok)[0]
found[i[k]] = j[k]
remaining = remaining[~ok]
return found
</code></pre>
<p>在调用此函数之前,我们为每个边界框计算一个“半径”,该半径是对角线p范数的一半。然后,我们使用总体最大半径<code>r</code>作为<code>KDTree</code>查询的最大距离。<code>KDTree</code>查询(<code>kd_points.query_ball_tree()</code>)有效地筛选所有边界框中心,并在一次调用中查找半径内的所有边界框中心。这是实际匹配的超集,但速度很快,大大减少了搜索空间。然后,过滤<em>实际上</em>在边界框中的点,并跟踪每个点的(第一个)匹配边界框</p>
<P>作为优化(这里没有实现),我们可以考虑边界框的大小(^ {< CD8>}数组)。如果判断差异太大,则可以将边界框分为两组(例如围绕<code>np.median(radius)</code>),并且可以对每一半进行相同的搜索(如果必要,再次递归)</p>
<p>对于OP示例,在准备以更易于使用的形式(所有Numpy数组)获取中心、边界框和半径后,该查询返回:</p>
<pre class="lang-py prettyprint-override"><code># preparation
xy = df1[['longitude', 'latitude']].values
bb = df2[['minlong', 'maxlong', 'minlat', 'maxlat']].values
centers = bb @ np.array([[1,1,0,0], [0,0,1,1]]).T / 2
radius = np.linalg.norm(bb @ np.array([[-1,1,0,0], [0,0,-1,1]]).T, axis=1) / 2
kd_boxes = KDTree(centers)
kd_points = KDTree(xy)
# query
>>> kd_points.query_ball_tree(kd_boxes, radius.max() * 1.001)
[[0], [2], [2], [], [], [], [1], [], [0], [0]]
</code></pre>
<p>使用曼哈顿范数可以得到类似的结果(在本例中,相同):</p>
<pre class="lang-py prettyprint-override"><code>radius = bb @ np.array([-1,1,-1,1]) / 2
>>> kd_points.query_ball_tree(kd_boxes, radius.max() * 1.001, p=1)
[[0], [2], [2], [], [], [], [1], [], [0], [0]]
</code></pre>
<p>下图中绘制了这两种解决方案,其中突出显示了所有候选边界框及其中心距离<code>r</code>内的实际面积:</p>
<img src="https://i.stack.imgur.com/viQEi.png" width="400"/>
<p>请注意,在这两种情况下,点<code>#2</code>是如何错误地包含在bbox<code>#2</code>中的。第一步,使用KD树,只是一个过滤步骤。下一步是检查每个点的候选<em>是否实际</em>包含该点。使用该筛选(用于<code>ok</code>数组的表达式),解决方案是:</p>
<img src="https://i.stack.imgur.com/sgljb.png" width="400"/>
<p>让我们看看如果我们有更多的点和边界框(可能有重叠),会发生什么:</p>
<pre class="lang-py prettyprint-override"><code>def gen_example(n, m, seed=None):
# let's make a more complex problem, with overlaps
if seed is not None:
np.random.seed(seed)
df1 = pd.DataFrame({
'latitude': np.random.uniform(0, 1 + 1 / np.sqrt(m), size=n),
'longitude': np.random.uniform(0, 1 + 1 / np.sqrt(m), size=n),
})
df2 = pd.DataFrame({
'minlat': np.random.uniform(size=m),
'maxlat': np.random.uniform(.5, 1, size=m) / np.sqrt(m),
'minlong': np.random.uniform(size=m),
'maxlong': np.random.uniform(.5, 1, size=m) / np.sqrt(m),
'STRING': [f'b_{i}' for i in range(m)],
})
df2['maxlat'] += df2['minlat']
df2['maxlong'] += df2['minlong']
return df1, df2
</code></pre>
<p>对于<code>df1, df2 = gen_example(32, 12, 0)</code>,相应的图片为:</p>
<img src="https://i.stack.imgur.com/fogok.png" width="400"/>
<img src="https://i.stack.imgur.com/a8Am8.png" width="400"/>
<p>候选项(查询结果,此处为<code>p=2</code>)为:</p>
<pre class="lang-py prettyprint-override"><code>>>> kd_cand = kd_points.query_ball_tree(kd_boxes, r * 1.001, p=2)
[[], [], [9], [], [], [10], [], [], [], [], [], [9], [10], [], [11], [11], [], [2, 6], [8], [8], [], [4], [7, 9], [4], [3, 5], [2, 4], [0], [], [7, 9], [7, 9], [0, 3, 5], [4]]
</code></pre>
<p>因为这是一个锯齿状数组,所以我们将其转换为带有<code>fill_value=-1</code>的矩形数组:</p>
<pre class="lang-py prettyprint-override"><code>def pad_jagged(a, fill_value=-1):
lens = np.array([len(r) for r in a])
v = np.array([e for r in a for e in r])
w = np.max(lens)
return np.insert(
v,
np.repeat(np.cumsum(lens), w - lens),
fill_value
).reshape(-1, w)
</code></pre>
<p>对于上面的填充阵列,这将提供:</p>
<pre class="lang-py prettyprint-override"><code>>>> pad_jagged(kd_cand)
[[-1 -1 -1]
[ 8 -1 -1]
[ 9 -1 -1]
...
[ 7 9 -1]
[ 0 3 5]
[ 4 -1 -1]]
</code></pre>
<p>现在,我们迭代这个数组的列,但是以贪婪的方式从以前的迭代中删除任何成功的匹配项(这就是<code>remaining</code>的原因)</p>
<p>处理预处理等的其他功能包括:</p>
<pre class="lang-py prettyprint-override"><code>def find_regions(xy, bb, p=2):
# xy: numpy array (n, 2)
# bb: numpy array (m, 4): the four columns are xmin, xmax, ymin, ymax
# for each point in xy, return the index of a region that contains it, or -1
centers = bb @ np.array([[1,1,0,0], [0,0,1,1]]).T / 2
assert p in {1,2}, "only 1- or 2-norm supported"
radius = bb @ np.array([-1,1,-1,1]) / 2 if p == 1 else np.linalg.norm(bb @ np.array([[-1,1,0,0], [0,0,-1,1]]).T, axis=1) / 2
kd_boxes = KDTree(centers)
kd_points = KDTree(xy)
return find(kd_boxes, kd_points, bb, radius.max(), p=p)
def find_region_label(df1, df2, nomatch=None, p=2):
found = find_regions(
df1[['longitude', 'latitude']].values,
df2[['minlong', 'maxlong', 'minlat', 'maxlat']].values,
p=p,
)
lbl = np.r_[df2['STRING'].values, [nomatch]]
return lbl[found]
</code></pre>
<p>关于OP示例:</p>
<pre><code>>>> df1.assign(LABEL=find_region_label(df1, df2))
latitude longitude LABEL
0 1.3 2.7 AAA
1 3.5 3.6 CCC
2 2.8 3.0 None
3 9.7 1.9 None
4 6.2 5.7 None
5 1.7 3.4 None
6 3.5 1.4 BBB
7 2.7 6.6 None
8 1.7 2.7 AAA
9 1.3 1.3 AAA
</code></pre>
<p><strong>速度测试</strong></p>
<p>现在进行速度测试,其中点数<code>n</code><em>和</em>边界框数<code>m</code>都较大:</p>
<pre class="lang-py prettyprint-override"><code>df1, df2 = gen_example(100_000, 10_000, 0)
</code></pre>
<p>与<a href="https://stackoverflow.com/a/68926522/758174">@norok2's numba solution</a>的速度比较(稍微调整以跟随OP的列名):</p>
<pre class="lang-py prettyprint-override"><code>a = %timeit -o find_region_label(df1, df2)
# 222 ms ± 904 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
b = %timeit -o locate_in_regions_nb(df1, df2)
# 5.38 s ± 40.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> b.average / a.average
24.255
</code></pre>
<p><strong>这是该数据的24倍加速。</strong></p>
<p>验证我们是否得到相同的解决方案:</p>
<pre class="lang-py prettyprint-override"><code>sa = find_region_label(df1, df2)
sb = locate_in_regions_nb(df1, df2)
>>> (sa == sb).all()
True
</code></pre>