大Pandas不规则时间序列数据的重采样与归一化处理

2024-10-16 20:42:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我有不规则间隔的时间序列数据。我有总的能量消耗和能量使用的持续时间。在

Start Date  Start Time      Duration (Hours)    Usage(kWh)
1/3/2016    12:28:00 PM     2.233333333         6.23
1/3/2016    4:55:00 PM      1.9                 11.45
1/4/2016    6:47:00 PM      7.216666667         11.93
1/4/2016    7:00:00 AM      3.45                9.45
1/4/2016    7:26:00 AM      1.6                 7.33
1/4/2016    7:32:00 AM      1.6                 4.54

我想计算15分钟内所有负荷曲线的总和。我可以在必要时取整(例如,最近的1分钟)。我不能立即使用重采样,因为它会将使用量平均到下一个时间戳中,在第一次输入1/3 12:28 PM的情况下,需要6.23 kWH并将其均匀分布到下午4:55,这是不准确的。6.23 kWh应持续到12:28 PM+2.23小时~=2:42 PM。在


Tags: 数据date间隔time时间序列amstart
2条回答

这是一个直接的实现,它只是建立一个系列, result,其索引具有分钟频率,然后在 df(使用df.itertuples)并为每个添加适当的功率 相关间隔中的行:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'Duration (Hours)': [2.233333333, 1.8999999999999999, 7.2166666670000001, 3.4500000000000002, 1.6000000000000001, 1.6000000000000001], 'Start Date': ['1/3/2016', '1/3/2016', '1/4/2016', '1/4/2016', '1/4/2016', '1/4/2016'], 'Start Time': ['12:28:00 PM', '4:55:00 PM', '6:47:00 PM', '7:00:00 AM', '7:26:00 AM', '7:32:00 AM'], 'Usage(kWh)': [6.2300000000000004, 11.449999999999999, 11.93, 9.4499999999999993, 7.3300000000000001, 4.54]} ) 

df['duration'] = pd.to_timedelta(df['Duration (Hours)'], unit='H')
df['start_date'] = pd.to_datetime(df['Start Date'] + ' ' + df['Start Time'])
df['end_date'] = df['start_date'] + df['duration']
df['power (kW/min)'] = df['Usage(kWh)']/(df['Duration (Hours)']*60)
df = df.drop(['Start Date', 'Start Time', 'Duration (Hours)'], axis=1)

result = pd.Series(0,
    index=pd.date_range(df['start_date'].min(), df['end_date'].max(), freq='T'))

power_idx = df.columns.get_loc('power (kW/min)')+1
for row in df.itertuples():
    result.loc[row.start_date:row.end_date] += row[power_idx]

# The sum of the usage over 15 minute windows is computed using the `resample/sum` method:
usage = result.resample('15T').sum()
usage.plot(kind='line', label='usage')
plt.legend(loc='best')
plt.show()

enter image description here

关于性能的注意事项:循环遍历df的行不是很好 特别是当len(df)很大时。为了获得更好的性能,您可能需要 more clever method,它处理 以矢量化方式“一次”所有行:

^{pr2}$

len(df)等于1000,using_cumsumusing_loop快10倍以上:

In [117]: %timeit using_loop(df)
1 loop, best of 3: 545 ms per loop

In [118]: %timeit using_cumsum(df)
10 loops, best of 3: 52.7 ms per loop

下面我使用的解决方案是itertuples方法。请注意使用numpy.sum函数对我不起作用。我改为使用pandas重采样关键字“how”并将其设置为sum。在

我还重命名了文件中的列,使导入更容易。在

我不受时间/资源限制,所以我使用itertuples方法,因为它很容易实现。在

i启动代码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#load data
df = pd.read_excel(r'C:\input_file.xlsx', sheetname='sheet1')

#convert columns
df['duration'] = pd.to_timedelta(df['Duration (Hours)'], unit='H')
df['end_date'] = df['start_date'] + df['duration']
df['power (kW/min)'] = df['Usage(kWh)']/(df['Duration (Hours)']*60)
df = df.drop(['Duration (Hours)'], axis=1)

#create result df with timestamps
result = pd.Series(0, index=pd.date_range(df['start_date'].min(), df['end_date'].max(), freq='T'))

#iterate through to calculate total energy at each minute
power_idx = df.columns.get_loc('power (kW/min)')+1
for row in df.itertuples():
    result.loc[row.start_date:row.end_date] += row[power_idx]

# The sum of the usage over 15 minute windows is computed using the `resample/sum` method
usage = result.resample('15T', how='sum')

#plot
plt.plot(usage)
plt.show()

#write to file
usage.to_csv(r'C:\output_folder\output_file.csv')

Solution using itertuples method

相关问题 更多 >