<p>找到离给定点最近的<code>N</code>点的蛮力方法是<code>O(N)</code>,您必须检查每个点。
相反,如果<code>N</code>点存储在<a href="https://en.wikipedia.org/wiki/K-d_tree" rel="noreferrer">KD-tree</a>中,则查找最近点平均是<code>O(log(N))</code>。
构建KD树还需要额外的一次性成本,这需要<code>O(N)</code>时间</p>
<p>如果需要重复此过程<code>N</code>次,则蛮力方法为<code>O(N**2)</code>,kd树方法为<code>O(N*log(N))</code>。
因此,对于足够大的<code>N</code>,KD树将击败蛮力方法</p>
<p>有关最近邻算法(包括KD树)的更多信息,请参见<a href="http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbor-algorithms" rel="noreferrer">here</a></p>
<hr/>
<p>下面(在函数<code>using_kdtree</code>中)是一种使用<a href="https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.KDTree.html" rel="noreferrer">^{<cd11>}</a>计算最近邻的大圆弧长的方法</p>
<p><code>scipy.spatial.kdtree</code>使用点之间的欧几里德距离,但有一个<a href="https://en.wikipedia.org/wiki/Great-circle_distance#From_chord_length" rel="noreferrer">formula</a>用于将球体上点之间的欧几里德弦距离转换为大圆弧长(给定球体半径)。
因此,我们的想法是将纬度/经度数据转换为笛卡尔坐标,使用<code>KDTree</code>查找最近的邻居,然后应用<a href="https://en.wikipedia.org/wiki/Great-circle_distance#From_chord_length" rel="noreferrer">great circle distance formula</a>获得所需的结果</p>
<hr/>
<p>以下是一些基准。使用<code>N = 100</code>,<code>using_kdtree</code>比<code>orig</code>(蛮力)方法快39倍</p>
<pre><code>In [180]: %timeit using_kdtree(data)
100 loops, best of 3: 18.6 ms per loop
In [181]: %timeit using_sklearn(data)
1 loop, best of 3: 214 ms per loop
In [179]: %timeit orig(data)
1 loop, best of 3: 728 ms per loop
</code></pre>
<p>对于<code>N = 10000</code>:</p>
<pre><code>In [5]: %timeit using_kdtree(data)
1 loop, best of 3: 2.78 s per loop
In [6]: %timeit using_sklearn(data)
1 loop, best of 3: 1min 15s per loop
In [7]: %timeit orig(data)
# untested; too slow
</code></pre>
<p>由于<code>using_kdtree</code>是<code>O(N log(N))</code>,而<code>orig</code>是<code>O(N**2)</code>,因此
哪一个<code>using_kdtree</code>比<code>orig</code>快,将随着<code>N</code>的增长而增长
<code>data</code>,生长</p>
<hr/>
<pre><code>import numpy as np
import scipy.spatial as spatial
import pandas as pd
import sklearn.neighbors as neighbors
from math import radians, cos, sin, asin, sqrt
R = 6367
def using_kdtree(data):
"Based on https://stackoverflow.com/q/43020919/190597"
def dist_to_arclength(chord_length):
"""
https://en.wikipedia.org/wiki/Great-circle_distance
Convert Euclidean chord length to great circle arc length
"""
central_angle = 2*np.arcsin(chord_length/(2.0*R))
arclength = R*central_angle
return arclength
phi = np.deg2rad(data['Latitude'])
theta = np.deg2rad(data['Longitude'])
data['x'] = R * np.cos(phi) * np.cos(theta)
data['y'] = R * np.cos(phi) * np.sin(theta)
data['z'] = R * np.sin(phi)
tree = spatial.KDTree(data[['x', 'y','z']])
distance, index = tree.query(data[['x', 'y','z']], k=2)
return dist_to_arclength(distance[:, 1])
def orig(data):
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
c = 2 * asin(sqrt(a))
km = R * c
return km
shortest_distance = []
for i in range(len(data)):
distance1 = []
for j in range(len(data)):
if i == j: continue
distance1.append(distance(data['Longitude'][i], data['Latitude'][i],
data['Longitude'][j], data['Latitude'][j]))
shortest_distance.append(min(distance1))
return shortest_distance
def using_sklearn(data):
"""
Based on https://stackoverflow.com/a/45127250/190597 (Jonas Adler)
"""
def distance(p1, p2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
lon1, lat1 = p1
lon2, lat2 = p2
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
km = R * c
return km
points = data[['Longitude', 'Latitude']]
nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
distances, indices = nbrs.kneighbors(points)
result = distances[:, 1]
return result
np.random.seed(2017)
N = 1000
data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N),
'Longitude':np.random.uniform(0,360,size=N)})
expected = orig(data)
for func in [using_kdtree, using_sklearn]:
result = func(data)
assert np.allclose(expected, result)
</code></pre>