大数据集Hausdorff距离的快速求解问题的回答

大数据集Hausdorff距离的快速求解

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

首先，我定义了一个提供一些样本数据的方法。如果你在问题中提供类似的东西，会容易得多。在大多数与性能相关的问题中，需要实际问题的大小来找到一个最优的解决方案。在 在下面的答案中，我将假定<code>id_easy</code>的平均大小为17，并且有30000个不同的id，这导致数据集大小为510u 000。在 创建样本数据 <pre><code>import numpy as np import numba as nb N_ids=30_000 av_id_size=17 #create_data (pre sorting according to id assumed) lat_lon=np.random.rand(N_ids*av_id_size,2) #create_ids (sorted array with ids) ids=np.empty(N_ids*av_id_size,dtype=np.int64) ind=0 for i in range(N_ids): for j in range(av_id_size): ids[i*av_id_size+j]=ind ind+=1 </code></pre> Hausdorff函数 下面的函数是来自scipy源代码的稍微修改的版本。进行了以下修改： <ul> <li>对于非常小的输入数组，我注释掉了洗牌部分（在更大的数组上启用洗牌，并在实际数据上尝试什么是最好的</li> <li>至少在Windows上，Anaconda scipy函数看起来有一些性能问题（比Linux慢得多），基于LLVM的Numba看起来是一致的</li> <li>移除Hausdorff对的指数</li> <li>（N，2）情况下展开的距离环 <pre><code>#Modified Code from Scipy-source #https://github.com/scipy/scipy/blob/master/scipy/spatial/_hausdorff.pyx #Copyright (C) Tyler Reddy, Richard Gowers, and Max Linke, 2016 #Copyright © 2001, 2002 Enthought, Inc. #All rights reserved. #Copyright © 2003-2013 SciPy Developers. #All rights reserved. #Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: #Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. #Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following #disclaimer in the documentation and/or other materials provided with the distribution. #Neither the name of Enthought nor the names of the SciPy Developers may be used to endorse or promote products derived #from this software without specific prior written permission. #THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, #BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. #IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, #OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; #OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT #(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. @nb.njit() def directed_hausdorff_nb(ar1, ar2): N1 = ar1.shape[0] N2 = ar2.shape[0] data_dims = ar1.shape[1] # Shuffling for very small arrays disbabled # Enable it for larger arrays #resort1 = np.arange(N1) #resort2 = np.arange(N2) #np.random.shuffle(resort1) #np.random.shuffle(resort2) #ar1 = ar1[resort1] #ar2 = ar2[resort2] cmax = 0 for i in range(N1): no_break_occurred = True cmin = np.inf for j in range(N2): # faster performance with square of distance # avoid sqrt until very end # Simplificaten (loop unrolling) for (n,2) arrays d = (ar1[i, 0] - ar2[j, 0])**2+(ar1[i, 1] - ar2[j, 1])**2 if d < cmax: # break out of `for j` loop no_break_occurred = False break if d < cmin: # always true on first iteration of for-j loop cmin = d # always true on first iteration of for-j loop, after that only # if d >= cmax if cmin != np.inf and cmin > cmax and no_break_occurred == True: cmax = cmin return np.sqrt(cmax) </code></pre></li> </ul> 计算子集上的Hausdorff距离 <pre><code>@nb.njit(parallel=True) def get_distance_mat(def_slice,lat_lon): Num_ids=def_slice.shape[0]-1 out=np.empty((Num_ids,Num_ids),dtype=np.float64) for i in nb.prange(Num_ids): ar1=lat_lon[def_slice[i:i+1],:] for j in range(i,Num_ids): ar2=lat_lon[def_slice[j:j+1],:] dist=directed_hausdorff_nb(ar1, ar2) out[i,j]=dist out[j,i]=dist return out </code></pre> 示例和计时 <pre><code>#def_slice defines the start and end of the slices _,def_slice=np.unique(ids,return_index=True) def_slice=np.append(def_slice,ids.shape[0]) %timeit res_1=get_distance_mat(def_slice,lat_lon) #1min 2s ± 301 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) </code></pre>

大数据集Hausdorff距离的快速求解

1 个回答

相关Python问题