<p>首先,我定义了一个提供一些样本数据的方法。如果你在问题中提供类似的东西,会容易得多。在大多数与性能相关的问题中,需要实际问题的大小来找到一个最优的解决方案。在</p>
<p>在下面的答案中,我将假定<code>id_easy</code>的平均大小为17,并且有30000个不同的id,这导致数据集大小为510u 000。在</p>
<p><strong>创建样本数据</strong></p>
<pre><code>import numpy as np
import numba as nb
N_ids=30_000
av_id_size=17
#create_data (pre sorting according to id assumed)
lat_lon=np.random.rand(N_ids*av_id_size,2)
#create_ids (sorted array with ids)
ids=np.empty(N_ids*av_id_size,dtype=np.int64)
ind=0
for i in range(N_ids):
for j in range(av_id_size):
ids[i*av_id_size+j]=ind
ind+=1
</code></pre>
<p><strong>Hausdorff函数</strong></p>
<p>下面的函数是来自scipy源代码的稍微修改的版本。
进行了以下修改:</p>
<ul>
<li>对于非常小的输入数组,我注释掉了洗牌部分(<strong>在更大的数组上启用</strong>洗牌,并在实际数据上尝试什么是最好的</li>
<li>至少在Windows上,Anaconda scipy函数看起来有一些性能问题(比Linux慢得多),基于LLVM的Numba看起来是一致的</li>
<li>移除Hausdorff对的指数</li>
<li><p>(N,2)情况下展开的距离环</p>
<pre><code>#Modified Code from Scipy-source
#https://github.com/scipy/scipy/blob/master/scipy/spatial/_hausdorff.pyx
#Copyright (C) Tyler Reddy, Richard Gowers, and Max Linke, 2016
#Copyright © 2001, 2002 Enthought, Inc.
#All rights reserved.
#Copyright © 2003-2013 SciPy Developers.
#All rights reserved.
#Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
#Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
#Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
#disclaimer in the documentation and/or other materials provided with the distribution.
#Neither the name of Enthought nor the names of the SciPy Developers may be used to endorse or promote products derived
#from this software without specific prior written permission.
#THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
#BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
#IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
#OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
#OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
#(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
@nb.njit()
def directed_hausdorff_nb(ar1, ar2):
N1 = ar1.shape[0]
N2 = ar2.shape[0]
data_dims = ar1.shape[1]
# Shuffling for very small arrays disbabled
# Enable it for larger arrays
#resort1 = np.arange(N1)
#resort2 = np.arange(N2)
#np.random.shuffle(resort1)
#np.random.shuffle(resort2)
#ar1 = ar1[resort1]
#ar2 = ar2[resort2]
cmax = 0
for i in range(N1):
no_break_occurred = True
cmin = np.inf
for j in range(N2):
# faster performance with square of distance
# avoid sqrt until very end
# Simplificaten (loop unrolling) for (n,2) arrays
d = (ar1[i, 0] - ar2[j, 0])**2+(ar1[i, 1] - ar2[j, 1])**2
if d < cmax: # break out of `for j` loop
no_break_occurred = False
break
if d < cmin: # always true on first iteration of for-j loop
cmin = d
# always true on first iteration of for-j loop, after that only
# if d >= cmax
if cmin != np.inf and cmin > cmax and no_break_occurred == True:
cmax = cmin
return np.sqrt(cmax)
</code></pre></li>
</ul>
<p><strong>计算子集上的Hausdorff距离</strong></p>
<pre><code>@nb.njit(parallel=True)
def get_distance_mat(def_slice,lat_lon):
Num_ids=def_slice.shape[0]-1
out=np.empty((Num_ids,Num_ids),dtype=np.float64)
for i in nb.prange(Num_ids):
ar1=lat_lon[def_slice[i:i+1],:]
for j in range(i,Num_ids):
ar2=lat_lon[def_slice[j:j+1],:]
dist=directed_hausdorff_nb(ar1, ar2)
out[i,j]=dist
out[j,i]=dist
return out
</code></pre>
<p><strong>示例和计时</strong></p>
<pre><code>#def_slice defines the start and end of the slices
_,def_slice=np.unique(ids,return_index=True)
def_slice=np.append(def_slice,ids.shape[0])
%timeit res_1=get_distance_mat(def_slice,lat_lon)
#1min 2s ± 301 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
</code></pre>