我有一个87288点的数据集,我需要过滤。数据集的过滤字段是X位置和Y位置,如纬度和经度。绘制的数据如下所示:
问题是,我只需要沿着特定路径的数据,这是预先知道的。像这样:
我已经知道如何在数据集中过滤数据,但是由于路径不是线性的,我需要一种有效的策略,以一定的精度清除所有有噪声的数据(由于数据集太大,手动拾取点不是一种选择)。你知道吗
这是一些样品数据唯一重要的列分别是纬度和经度、Y和X。你知道吗
Sesion,Tiempo,Latitud,Longitud,PM2.5,Modo,Hora,DiaSemana
M-O-AM-07OCT19-DMR,2019-10-01 09:48:17.625,3.3659550000000005,-76.5288288,13.0,OUTDOOR,AM,1
M-O-AM-07OCT19-DMR,2019-10-07 08:18:03.555,3.3661757000000003,-76.5289441,12.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:04.596,3.3661757000000003,-76.5289441,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:05.572,3.3661767,-76.5289375,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:06.614,3.3661790999999996,-76.5289188,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:07.581,3.3661814,-76.5289024,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:08.588,3.3661847999999996,-76.52889820000001,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:09.570,3.3661922,-76.52890450000001,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:10.579,3.3661922,-76.52890450000001,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:11.577,3.3662135,-76.52893370000001,12.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:12.611,3.3662227999999996,-76.5289516,12.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:13.561,3.3662227999999996,-76.5289516,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:14.631,3.3662346,-76.5289927,11.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:15.554,3.3662421,-76.52901440000001,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:16.623,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:17.593,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:18.617,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:19.608,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:20.605,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:21.594,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:22.608,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:23.620,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:24.611,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:25.622,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:26.590,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:27.619,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:28.595,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:29.628,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
M-O-AM-07OCT19-DMR,2019-10-07 08:18:30.621,3.3662523999999996,-76.5290363,10.0,OUTDOOR,AM,0
我试过在路线中手工挑选几个点,然后用固定的最小距离过滤剩下的点,类似这样的。你知道吗
import pandas as pd
import random
import matplotlib.pyplot as plt
import seaborn as sns
from cycler import cycler
import numpy as np
from salem import get_demo_file, DataLevels, GoogleVisibleMap, Map
import geopy.distance
def get_dist(coords_1 , coords_2):
return geopy.distance.distance(coords_1, coords_2).meters
dists=[
(-76.5297163,3.3665631),
(-76.5307019,3.3656924),
(-76.5314718,3.3646900),
(-76.5319956,3.3638394),
(-76.5316622,3.3621781),
(-76.5311999,3.3611796),
(-76.5308636,3.3599338),
(-76.5306335,3.3585191),
(-76.5304758,3.3577502),
(-76.5303957,3.3561101),
(-76.5302998,3.3543178),
(-76.5302220,3.3531897),
(-76.5302369,3.3515283),
(-76.5303363,3.3502667),
(-76.5305351,3.3485951),
(-76.5306779,3.3475220),
(-76.5308545,3.3456382),
(-76.5307738,3.3446934),
(-76.530618,3.3430422)
]
df = pd.read_csv('movil.csv')
for index, row in df.iterrows():
if index%1000 ==0:
print(index)
mind=None
for i in dists:
if mind:
d=get_dist((row['Latitud'],row['Longitud']),(i[1],i[0]))
if d<mind:
mind=d
else:
mind=get_dist((row['Latitud'],row['Longitud']),(i[1],i[0]))
if mind>125:
df.drop(index, inplace=True)
print(df)
使用这些方法,我设法得到一些清洁,但我觉得很多有用的数据正在得到过滤。你知道吗
让我们从一些示例数据开始。请注意,纬度和经度记录在度中,用于生成和打印,但记录在弧度中用于计算。你知道吗
接下来,我们可以定义一个返回两点之间距离的向量化函数。这应该适用于一维或二维数组。你知道吗
然后,我们可以尝试找到从每个轨迹点到任何一个参考轨迹点的最小距离。这在
O(N*M)
计算上是昂贵的,但是我们可以通过将参考点和轨迹点广播到二维数组中来对其进行矢量化。你知道吗最后,我们可以选择一个公差并过滤最小距离小于公差的点。你知道吗
最后,我们可以使用布尔
near_ref
掩码来过滤traj
数据帧:相关问题 更多 >
编程相关推荐