Python:按时间和空间分组

2024-10-03 02:40:07 发布

您现在位置:Python中文网/ 问答频道 /正文

    ID  timestamp                lat          lon
0   A   2020-03-20 00:17:10 42.360000   -71.090000
1   A   2020-03-20 00:20:51 42.360000   -71.090000
2   A   2020-03-20 00:35:31 42.360000   -71.090000
3   A   2020-03-20 00:35:34 42.360000   -71.090000
4   B   2020-03-20 01:48:14 42.360000   -71.100000
5   B   2020-03-20 03:15:00 42.360000   -71.100000
6   C   2020-03-20 11:05:47 42.365259   -71.103502
7   D   2020-03-20 10:53:43 42.363174   -71.096756
8   D   2020-03-20 10:57:45 42.363260   -71.095598
9   D   2020-03-20 11:04:24 42.363303   -71.094997

我想看看两个用户之间是否在100米的半径范围内,在白天的任何时间至少10秒有重叠。我希望有如下输出

df
      usuerI     userJ     centroid.lat  centroid.lon     time
0      A          B         42.360000      -71.094997      33s
1      B          D         42.365259      -71.103502      5s

Tags: 用户iddftime时间半径timestamplon
2条回答

我不知道你尝试了什么,但你可以这样开始。 我没有考虑10秒的计时,但很容易添加。我使用geopy.distance.distance来测量距离。下面的代码将这些遭遇存储在一个列表中,您可以从中轻松构建一个新的数据帧

import numpy as np
import geopy.distance

# threshold distance in km
threshold_distance = 0.1

# list of IDs
id_list = list(df.index.levels[1])

# combinations of IDs
combs = list(combinations(id_list, 2))

# list to store the indices of the meetings
meetings = []

# go through combinations
for i, j in combs:

    # get the indices (numbers) of both IDs
    i_indices = [a[0] for a in df.iloc[df.index.get_level_values(1) == i].index.values]
    j_indices = [a[0] for a in df.iloc[df.index.get_level_values(1) == j].index.values]


    # go through the ID's data
    for i_index in i_indices:
        for j_index in j_indices:
                # if the date coincides
                if df.at[(i_index, i), "date"]!=df.at[(j_index, j), "date"]:
                    continue

                # use geopy to calculate the distance from the coordinates
                coords1 = (df.at[(i_index,i),"lat"],df.at[(i_index, i),"lon"])
                coords2 = (df.at[(j_index,j),"lat"],df.at[(j_index, j),"lon"])
                if geopy.distance.distance(coords1, coords2).km < threshold_distance:
                    meetings.append((i_index, j_index))

对于本例,我添加了一些额外的行,以使数据工作得更好。值得注意的是,这种方法需要大量的优化和错误处理才能很好地扩展

   ID  timestamp                lat          lon
0   A   2020-03-20 00:17:10 42.360000   -71.090000
1   A   2020-03-20 00:20:51 42.360000   -71.090000
2   A   2020-03-20 00:35:31 42.360000   -71.090000
3   A   2020-03-20 00:35:34 42.360000   -71.090000
4   B   2020-03-20 01:48:14 42.360000   -71.100000
5   B   2020-03-20 03:15:00 42.360000   -71.100000
6   C   2020-03-20 11:05:47 42.365259   -71.103502
7   D   2020-03-20 10:53:43 42.363174   -71.096756
8   D   2020-03-20 10:57:45 42.363260   -71.095598
9   D   2020-03-20 11:04:24 42.363303   -71.094997
10  E   2020-03-20 00:35:33 42.360001   -71.090001
11  F   2020-03-20 01:48:15 42.360003   -71.100099

接下来我们需要稍微调整一下df

import pandas as pd
import datetime
import numpy as np
from scipy import spatial

df = pd.read_clipboard(sep=r"[ ]{2,}")
df['lat_fix'] = df['lat'].str[-10:]
df['time'] = df['lat'].str[0:19] 
df['ID'] = df['timestamp']
df['lat'] = df['lat_fix']
df = df[['ID', 'time', 'lat', 'lon']]
df['lat'] = pd.to_numeric(df['lat'])
df['time'] = pd.to_datetime(df['time'])
df['idx'] = range(0, df.shape[0])
df.set_index('time', inplace=True)

然后我们找到距离阈值内的点。距离_thresh_列表存储列表列表,其中子列表包含每组距离小于~100m的点的idx值

x, y = df['lon'], df['lat'] 
points = np.array(list(zip(x.ravel(), y.ravel())))
tree = spatial.cKDTree(points)

distance_thresh_list = []
for p in points:
#0.0009 in decimal degrees is very close to 100m
    x = tree.query_ball_point(p, 0.0009)
    if len(x) > 1 and x not in distance_thresh_list:
        distance_thresh_list.append(x)

然后我们寻找唯一的ID

spatial_matches_list = []
df_spatial_match_list = []

for i in distance_thresh_list:
    df_slice = df[df['idx'].isin(i)]
    uniq_id_list = df_slice.ID.unique().tolist()
    if len(uniq_id_list) > 1 and uniq_id_list not in spatial_matches_list:
        print(uniq_id_list)
        spatial_matches_list.append(uniq_id_list)

        df_spatial_match = df[df['ID'].isin(uniq_id_list)]

        df_spatial_match = df[df['idx'].isin(i)]
        print(i)
        print(df_spatial_match)

        df_spatial_match_list.append(df_spatial_match)

最后,我们寻找时间匹配

for df in df_spatial_match_list:
    for idx, row in df.iterrows():
        before_window = idx + datetime.timedelta(seconds=-10)

        after_window = idx + datetime.timedelta(seconds=10)

        df_spatial_match_slice = df[(df.index.get_level_values(0) >= before_window) & (df.index.get_level_values(0) <= after_window)]

        if len(df_spatial_match_slice['ID'].unique().tolist()) > 1:
            print(df_spatial_match_slice)

以下是匹配项(有重复项)

                    ID        lat        lon  idx
time                                             
2020-03-20 00:35:31  A  42.360000 -71.090000    2
2020-03-20 00:35:34  A  42.360000 -71.090000    3
2020-03-20 00:35:33  E  42.360001 -71.090001   10
                    ID        lat        lon  idx
time                                             
2020-03-20 00:35:31  A  42.360000 -71.090000    2
2020-03-20 00:35:34  A  42.360000 -71.090000    3
2020-03-20 00:35:33  E  42.360001 -71.090001   10
                    ID        lat        lon  idx
time                                             
2020-03-20 00:35:31  A  42.360000 -71.090000    2
2020-03-20 00:35:34  A  42.360000 -71.090000    3
2020-03-20 00:35:33  E  42.360001 -71.090001   10
                    ID        lat        lon  idx
time                                             
2020-03-20 01:48:14  B  42.360000 -71.100000    4
2020-03-20 01:48:15  F  42.360003 -71.100099   11
                    ID        lat        lon  idx
time                                             
2020-03-20 01:48:14  B  42.360000 -71.100000    4
2020-03-20 01:48:15  F  42.360003 -71.100099   11

因此,上面的代码只查看ID在一段时间内是否彼此靠近。如果我们想计算ID彼此靠近的时间,我们可以这样做

id_min_max_dict = {}

for i in df_spatial_match_slice_list:
    for j in i.ID.unique().tolist():

        id_slice = i.loc[i['ID'] == j]
        id_slice_time_max = id_slice.index.max()
        id_slice_time_min = id_slice.index.min()

        id_min_max_dict[j] = [id_slice_time_min, id_slice_time_max]

一旦我们有了存储时间范围的dict,我们就可以看到同一位置的id之间有多少共享秒

for i in spatial_matches_list:
    print(i)
    time_range1 = pd.date_range(id_min_max_dict[i[0]][0], id_min_max_dict[i[0]][1], freq='S') 
    time_range2 = pd.date_range(id_min_max_dict[i[1]][0], id_min_max_dict[i[1]][1], freq='S')


    time_range_intersection = time_range1.intersection(time_range2)
    print(time_range_intersection)
    print(str(len(time_range_intersection)) + ' seconds of time within ~100m')

因此,时间/位置交叉点如下所示。FWIW,如果没有更多的样本数据行,这不是很令人兴奋,而且这种方法需要额外的复杂性才能处理2个以上的唯一ID

['A', 'E']
DatetimeIndex(['2020-03-20 00:35:33'], dtype='datetime64[ns]', freq=None)
1 seconds of time within ~100m
['B', 'F']
DatetimeIndex([], dtype='datetime64[ns]', freq=None)
0 seconds of time within ~100m

相关问题 更多 >