Pandas删除重复项并保留最接近参考时间的值

2024-10-02 10:27:47 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个结构如下的pd.DataFrame

$ df.head()
        target             refTime        name   latitude   longitude  
0          5.0 2020-05-31 22:48:00        YMLT -41.529999  147.190002   
1          6.0 2020-05-31 22:51:00        YWGT -36.419998  146.300003   
2          6.0 2020-05-31 22:56:00        YMAY -36.060001  146.929993   
3          5.0 2020-05-31 22:47:00        SUMU -34.830002  -56.000000   
4          3.0 2020-05-31 22:46:00        FACT -33.990002   18.600000   
5          5.0 2020-05-31 23:00:00        SGES -25.450001  -54.849998   
6          5.0 2020-05-31 23:00:00        SGAS -25.250000  -57.520000
7          5.0 2020-05-31 22:59:00        SUMU -34.830002  -56.000000      
8          8.0 2020-05-31 23:00:00        NFFN -17.750000  177.449997   
9          7.0 2020-05-31 23:00:00        SBPS -16.430000  -39.080002   
10         7.0 2020-05-31 22:50:00        NSTU -14.330000 -170.720001  

这是2020年5月31日各台站记录的天气数据。可能存在重复的(站点)名称,例如SUMU。但是,每个副本都有不同的refTime

问题:我如何删除重复的电台记录,同时保留最接近“完整”小时的观测值(在本例中,23:00:00)?对于这个特定的示例,我最终会删除第3行


Tags: nametargetdataframedf记录结构headpd
3条回答

我们可以使用pandas.drop_duplicates()函数删除与另一行重复的所有行。 在您的情况下,调用df.drop_duplicates()

让我们试试assigndt.round

我们的想法是按距离小时的距离对值进行排序,然后对值进行排序,并保留第一个实例

import pandas as pd
import numpy as np
df2 = (
    df.assign(
        hour_diff=(df["refTime"].dt.round("H") - df["refTime"]) / np.timedelta64(1, "m")
    )
    .sort_values("hour_diff")
    .drop_duplicates(subset=["name"], keep="first")
    .drop("hour_diff", axis=1)
)

print(df2)

    target             refTime  name   latitude   longitude
5      5.0 2020-05-31 23:00:00  SGES -25.450001  -54.849998
6      5.0 2020-05-31 23:00:00  SGAS -25.250000  -57.520000
8      8.0 2020-05-31 23:00:00  NFFN -17.750000  177.449997
9      7.0 2020-05-31 23:00:00  SBPS -16.430000  -39.080002
7      5.0 2020-05-31 22:59:00  SUMU -34.830002  -56.000000
2      6.0 2020-05-31 22:56:00  YMAY -36.060001  146.929993
1      6.0 2020-05-31 22:51:00  YWGT -36.419998  146.300003
10     7.0 2020-05-31 22:50:00  NSTU -14.330000 -170.720001
0      5.0 2020-05-31 22:48:00  YMLT -41.529999  147.190002
4      3.0 2020-05-31 22:46:00  FACT -33.990002   18.600000

距离度量将如下所示:

df.assign(
         hour_diff=(df["refTime"].dt.round("H") - df["refTime"]) / np.timedelta64(1, "m")
        )

    target             refTime  name   latitude   longitude  hour_diff
0      5.0 2020-05-31 22:48:00  YMLT -41.529999  147.190002       12.0
1      6.0 2020-05-31 22:51:00  YWGT -36.419998  146.300003        9.0
2      6.0 2020-05-31 22:56:00  YMAY -36.060001  146.929993        4.0
3      5.0 2020-05-31 22:47:00  SUMU -34.830002  -56.000000       13.0 # we drop this
4      3.0 2020-05-31 22:46:00  FACT -33.990002   18.600000       14.0
5      5.0 2020-05-31 23:00:00  SGES -25.450001  -54.849998        0.0
6      5.0 2020-05-31 23:00:00  SGAS -25.250000  -57.520000        0.0
7      5.0 2020-05-31 22:59:00  SUMU -34.830002  -56.000000        1.0 # we keep this one
8      8.0 2020-05-31 23:00:00  NFFN -17.750000  177.449997        0.0
9      7.0 2020-05-31 23:00:00  SBPS -16.430000  -39.080002        0.0
10     7.0 2020-05-31 22:50:00  NSTU -14.330000 -170.720001       10.0

使用^{}表示舍入值,用ref_time减去,用^{}创建绝对值,用^{}表示最小值的索引,因此可以通过^{}选择:

idx = (df["refTime"].dt.round("H").sub(df["refTime"]).abs()
                    .groupby(df['name'], sort=False).idxmin())
df = df.loc[idx]
print (df)
    target             refTime  name   latitude   longitude
0      5.0 2020-05-31 22:48:00  YMLT -41.529999  147.190002
1      6.0 2020-05-31 22:51:00  YWGT -36.419998  146.300003
2      6.0 2020-05-31 22:56:00  YMAY -36.060001  146.929993
7      5.0 2020-05-31 22:59:00  SUMU -34.830002  -56.000000
4      3.0 2020-05-31 22:46:00  FACT -33.990002   18.600000
5      5.0 2020-05-31 23:00:00  SGES -25.450001  -54.849998
6      5.0 2020-05-31 23:00:00  SGAS -25.250000  -57.520000
8      8.0 2020-05-31 23:00:00  NFFN -17.750000  177.449997
9      7.0 2020-05-31 23:00:00  SBPS -16.430000  -39.080002
10     7.0 2020-05-31 22:50:00  NSTU -14.330000 -170.720001

详细信息

print (df["refTime"].dt.round("H").sub(df["refTime"]))
0    00:12:00
1    00:09:00
2    00:04:00
3    00:13:00
4    00:14:00
5    00:00:00
6    00:00:00
7    00:01:00
8    00:00:00
9    00:00:00
10   00:10:00
Name: refTime, dtype: timedelta64[ns]

print (idx)
name
YMLT     0
YWGT     1
YMAY     2
SUMU     7
FACT     4
SGES     5
SGAS     6
NFFN     8
SBPS     9
NSTU    10
Name: refTime, dtype: int64

相关问题 更多 >

    热门问题