如何过滤时间戳非常接近的数据点

2024-05-20 04:08:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我有数据帧df

          Id    timestamp               data    sig     events1 Start   Peak    Timediff    Datadiff
104513  104754  2012-03-21 16:23:21.323 19.5    1.0     0.0     1.0     0.0     28732.920   0.5
104514  104755  2012-03-21 16:23:21.423 20.0    -1.0    0.0     0.0     1.0     0.100       0.5
104623  104864  2012-03-22 04:27:04.550 19.5    0.0     0.0     0.0     0.0     43423.127   -0.5
104630  104871  2012-03-22 04:27:11.670 19.5    -1.0    0.0     0.0     1.0     7.120       0.0
105147  105388  2012-03-23 06:12:24.523 19.0    -1.0    0.0     0.0     0.0     92712.853   -0.5
105148  105389  2012-03-23 06:12:24.623 18.5    1.0     1.0     0.0     0.0     0.100       -0.5

对于Timediff == 0.100timestamp具有相同的时、分和秒(例如16:23:21.32316:23:21.423)的行,删除该对中的第一行,即16:23:21.323(前一行)。如何执行此操作

预期成果:

          Id    timestamp               data    sig     events1 Start   Peak    Timediff    Datadiff

104514  104755  2012-03-21 16:23:21.423 20.0    -1.0    0.0     0.0     1.0     0.100       0.5
104623  104864  2012-03-22 04:27:04.550 19.5    0.0     0.0     0.0     0.0     43423.127   -0.5
104630  104871  2012-03-22 04:27:11.670 19.5    -1.0    0.0     0.0     1.0     7.120       0.0
105148  105389  2012-03-23 06:12:24.623 18.5    1.0     1.0     0.0     0.0     0.100       -0.5

Tags: 数据iddfdatadatadiffstarttimestampsig
3条回答

可能有一种更简单的方法来实现这一点,但我相信,考虑到您的条件,这是可行的:

# reset index to make our life easier
df = df.reset_index()

# get rows with small timediffs
small_timediffs = df.loc[df.Timediff <= 0.1]
# get rows immediately above rows with small timediffs
prev_small_timediffs = df.loc[small_timediffs.index-1]

# helper function to allow comparison of timestamps without milisseconds
remove_milisseconds = lambda x: x.split('.')[0]

# compare if timestamps from rows with small timediffs 
# and rows above them are the same 
stamps = small_timediffs.timestamp.apply(remove_milisseconds).values
prev_stamps = prev_small_timediffs.timestamp.apply(remove_milisseconds).values
cond = (stamps == prev_stamps)

# extract results based on both conditions
# (small timediff and same timestamp)
result = df.loc[~df.index.isin(small_timediffs.loc[cond].index-1)]

我已将一些值添加到您的数据帧中,以接收值为0.100的记录的相同时、分和秒

df = pd.DataFrame({'data':['16:23:21.323 19.5','16:23:21.423 20.0','04:27:04.550 19.5',
                               '04:27:11.670 19.5','06:12:24.523 19.0','06:12:24.623 18.5',
                               '16:23:21.323 19.5','06:12:24.523 19.0'],
                       'timestamp':['2012-03-21','2012-03-21','2012-03-22','2012-03-22','2012-03-23','2012-03-23',
                                    '2012-03-23','2012-03-23'],
                       'timediff':[28732.920,0.100,43423.127,7.120,0.100,0.100,92712.853,0.100]})

df['h_m_s'] = df['data'].str.extract(r'(.*) ')

                data   timediff   timestamp         h_m_s
0  16:23:21.323 19.5  28732.920  2012-03-21  16:23:21.323
1  16:23:21.423 20.0      0.100  2012-03-21  16:23:21.423
2  04:27:04.550 19.5  43423.127  2012-03-22  04:27:04.550
3  04:27:11.670 19.5      7.120  2012-03-22  04:27:11.670
4  06:12:24.523 19.0      0.100  2012-03-23  06:12:24.523
5  06:12:24.623 18.5      0.100  2012-03-23  06:12:24.623
6  16:23:21.323 19.5  92712.853  2012-03-23  16:23:21.323
7  06:12:24.523 19.0      0.100  2012-03-23  06:12:24.523

df_same = df[(df.timediff==0.100)]
df = df[~(df.timediff==0.100)]
df_same = df_same.drop_duplicates(subset=['h_m_s'],keep='first')
df=pd.concat([df,df_same])
print(df)
                data   timediff   timestamp         h_m_s
0  16:23:21.323 19.5  28732.920  2012-03-21  16:23:21.323
2  04:27:04.550 19.5  43423.127  2012-03-22  04:27:04.550
3  04:27:11.670 19.5      7.120  2012-03-22  04:27:11.670
6  16:23:21.323 19.5  92712.853  2012-03-23  16:23:21.323
1  16:23:21.423 20.0      0.100  2012-03-21  16:23:21.423
4  06:12:24.523 19.0      0.100  2012-03-23  06:12:24.523
5  06:12:24.623 18.5      0.100  2012-03-23  06:12:24.623

Idea是按^{}按每秒的楼层日期时间分组,然后删除前一行(如果存在至少一个匹配的掩码m1):

#convert to datetimes
df['timestamp'] = pd.to_datetime(df['timestamp'])

#test 0.1
m1 = df['Timediff'].eq(0.1)
#test previous values of m1
m2 = m1.shift(-1, fill_value=False)
#test if in groups floored by seconds is at least one True in m1
m3 = m1.groupby([df['timestamp'].dt.floor('S')]).transform('any')
#filter chained m3 with m2 and invert mask
df = df[~(m3 & m2)]
print (df)
            Id               timestamp  data  sig  events1  Start  Peak  \
104514  104755 2012-03-21 16:23:21.423  20.0 -1.0      0.0    0.0   1.0   
104623  104864 2012-03-22 04:27:04.550  19.5  0.0      0.0    0.0   0.0   
104630  104871 2012-03-22 04:27:11.670  19.5 -1.0      0.0    0.0   1.0   
105148  105389 2012-03-23 06:12:24.623  18.5  1.0      1.0    0.0   0.0   

         Timediff  Datadiff  
104514      0.100       0.5  
104623  43423.127      -0.5  
104630      7.120       0.0  
105148      0.100      -0.5  

编辑:用下一个选项回答表单注释:

m1 = df3['Timediff'].eq(0.100) & df3['Start'].ne(1) & df3['Start'].shift(1).ne(1)

相关问题 更多 >