大Pandas不规则时间序列数据的重采样与归一化处理

2条回答

网友

1楼 · 编辑于 2024-10-16 20:42:15

这是一个直接的实现，它只是建立一个系列， result，其索引具有分钟频率，然后在 df（使用df.itertuples）并为每个添加适当的功率相关间隔中的行：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'Duration (Hours)': [2.233333333, 1.8999999999999999, 7.2166666670000001, 3.4500000000000002, 1.6000000000000001, 1.6000000000000001], 'Start Date': ['1/3/2016', '1/3/2016', '1/4/2016', '1/4/2016', '1/4/2016', '1/4/2016'], 'Start Time': ['12:28:00 PM', '4:55:00 PM', '6:47:00 PM', '7:00:00 AM', '7:26:00 AM', '7:32:00 AM'], 'Usage(kWh)': [6.2300000000000004, 11.449999999999999, 11.93, 9.4499999999999993, 7.3300000000000001, 4.54]} ) 

df['duration'] = pd.to_timedelta(df['Duration (Hours)'], unit='H')
df['start_date'] = pd.to_datetime(df['Start Date'] + ' ' + df['Start Time'])
df['end_date'] = df['start_date'] + df['duration']
df['power (kW/min)'] = df['Usage(kWh)']/(df['Duration (Hours)']*60)
df = df.drop(['Start Date', 'Start Time', 'Duration (Hours)'], axis=1)

result = pd.Series(0,
    index=pd.date_range(df['start_date'].min(), df['end_date'].max(), freq='T'))

power_idx = df.columns.get_loc('power (kW/min)')+1
for row in df.itertuples():
    result.loc[row.start_date:row.end_date] += row[power_idx]

# The sum of the usage over 15 minute windows is computed using the `resample/sum` method:
usage = result.resample('15T').sum()
usage.plot(kind='line', label='usage')
plt.legend(loc='best')
plt.show()

关于性能的注意事项：循环遍历df的行不是很好特别是当len(df)很大时。为了获得更好的性能，您可能需要 more clever method，它处理以矢量化方式“一次”所有行：

^{pr2}$

当len(df)等于1000，using_cumsum比using_loop快10倍以上：

In [117]: %timeit using_loop(df)
1 loop, best of 3: 545 ms per loop

In [118]: %timeit using_cumsum(df)
10 loops, best of 3: 52.7 ms per loop

网友

2楼 · 编辑于 2024-10-16 20:42:15

下面我使用的解决方案是itertuples方法。请注意使用numpy.sum函数对我不起作用。我改为使用pandas重采样关键字“how”并将其设置为sum。在

我还重命名了文件中的列，使导入更容易。在

我不受时间/资源限制，所以我使用itertuples方法，因为它很容易实现。在

i启动代码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#load data
df = pd.read_excel(r'C:\input_file.xlsx', sheetname='sheet1')

#convert columns
df['duration'] = pd.to_timedelta(df['Duration (Hours)'], unit='H')
df['end_date'] = df['start_date'] + df['duration']
df['power (kW/min)'] = df['Usage(kWh)']/(df['Duration (Hours)']*60)
df = df.drop(['Duration (Hours)'], axis=1)

#create result df with timestamps
result = pd.Series(0, index=pd.date_range(df['start_date'].min(), df['end_date'].max(), freq='T'))

#iterate through to calculate total energy at each minute
power_idx = df.columns.get_loc('power (kW/min)')+1
for row in df.itertuples():
    result.loc[row.start_date:row.end_date] += row[power_idx]

# The sum of the usage over 15 minute windows is computed using the `resample/sum` method
usage = result.resample('15T', how='sum')

#plot
plt.plot(usage)
plt.show()

#write to file
usage.to_csv(r'C:\output_folder\output_file.csv')

相关问题更多 >

编程相关推荐

热门问题

热门文章

大Pandas不规则时间序列数据的重采样与归一化处理

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >