更快的外部产品,如使用功能pandas查找最近的lat long

2024-09-28 05:17:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个熊猫数据帧a,带有纬度经度

import pandas as pd

df_a = pd.DataFrame([['b',1.591797,103.857887],
                 ['c',1.589416, 103.865322]],
                columns = ['place','lat','lng'])

我有另一个位置B的数据帧,也有纬度经度

df_b = pd.DataFrame([['ref1',1.594832, 103.853703],
                 ['ref1',1.589749, 103.864678]],
                columns = ['place','lat','lng'])

对于A中的每一行,我想找到B中最接近的匹配行(受距离限制)。 --&燃气轮机;我已经有了一个计算两对GPS之间距离的函数

预期输出

# a list where each row is the corresponding closest index in B
In [13]: min_index_arr
Out[13]: [0, 1]

一种方法是:

def haversine(pair1, pair2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    lon1, lat1 = pair1
    lon2, lat2 = pair2
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

import operator
min_vals = []
for i in df_a.index:
    pair1 = df_a['lat'][i], df_a['lng'][i]
    dist_array = []
    for j in df_b.index:
        pair2 = df_b['lat'][j], df_b['lng'][j]
        dist = haversine(pair1, pair2)
        dist_array.append(dist)
    min_index, min_value = min(enumerate(dist_array), key=operator.itemgetter(1))
    min_vals.append(max_index)

但我相信有一种更快的方法可以做到这一点,它似乎非常类似于外部产品,除了不是产品,而是使用功能。有人知道怎么做吗


Tags: theindfindexdistminpdlng
1条回答
网友
1楼 · 发布于 2024-09-28 05:17:36

使用来自KDTree for longitude/latitude的方法

基于sklearn.balltree

代码

# Setup Balltree using df_b as reference dataset
bt = BallTree(np.deg2rad(df_b[['lat', 'lng']].values), metric='haversine')

# Setup distance queries
query_lats = df_a['lat']
query_lons = df_a['lng']

# Find closest city in reference dataset for each city in a
distances, indices = bt.query(np.deg2rad(np.c_[query_lats, query_lons]))

# Result
r_km = 6371
for p, d, i in zip(df_a['place'][:], distances.flatten(), indices.flatten()):
  print(f"Place {p} closest to {df_b['place'][i]} with distance {d*r_km:.4f} km")

输出

Place b closest to ref1 with distance 0.5746 km
Place c closest to ref2 with distance 0.0806 km

相关问题 更多 >

    热门问题