如何将具有datetime索引的df重采样到n个大小相等的时段？问题的回答

如何将具有datetime索引的df重采样到n个大小相等的时段？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<p>这里有一种确保子时段大小相等的方法，方法是在<code>pd.Timedelta</code>上使用<code>np.linspace()</code>，然后使用<code>pd.cut</code>将每个ob分类到不同的bin中。在</p> <pre><code>import pandas as pd import numpy as np # generate artificial data np.random.seed(0) df = pd.DataFrame(np.random.randn(100, 2), columns=['A', 'B'], index=pd.date_range('2015-01-01 00:00:00', periods=100, freq='8H')) Out[87]: A B 2015-01-01 00:00:00 1.7641 0.4002 2015-01-01 08:00:00 0.9787 2.2409 2015-01-01 16:00:00 1.8676 -0.9773 2015-01-02 00:00:00 0.9501 -0.1514 2015-01-02 08:00:00 -0.1032 0.4106 2015-01-02 16:00:00 0.1440 1.4543 2015-01-03 00:00:00 0.7610 0.1217 2015-01-03 08:00:00 0.4439 0.3337 2015-01-03 16:00:00 1.4941 -0.2052 2015-01-04 00:00:00 0.3131 -0.8541 2015-01-04 08:00:00 -2.5530 0.6536 2015-01-04 16:00:00 0.8644 -0.7422 2015-01-05 00:00:00 2.2698 -1.4544 2015-01-05 08:00:00 0.0458 -0.1872 2015-01-05 16:00:00 1.5328 1.4694 ... ... ... 2015-01-29 08:00:00 0.9209 0.3187 2015-01-29 16:00:00 0.8568 -0.6510 2015-01-30 00:00:00 -1.0342 0.6816 2015-01-30 08:00:00 -0.8034 -0.6895 2015-01-30 16:00:00 -0.4555 0.0175 2015-01-31 00:00:00 -0.3540 -1.3750 2015-01-31 08:00:00 -0.6436 -2.2234 2015-01-31 16:00:00 0.6252 -1.6021 2015-02-01 00:00:00 -1.1044 0.0522 2015-02-01 08:00:00 -0.7396 1.5430 2015-02-01 16:00:00 -1.2929 0.2671 2015-02-02 00:00:00 -0.0393 -1.1681 2015-02-02 08:00:00 0.5233 -0.1715 2015-02-02 16:00:00 0.7718 0.8235 2015-02-03 00:00:00 2.1632 1.3365 [100 rows x 2 columns] # cutoff points, 10 equal-size group requires 11 points # measured by timedelta 1 hour time_delta_in_hours = (df.index - df.index[0]) / pd.Timedelta('1h') n = 10 ts_cutoff = np.linspace(0, time_delta_in_hours[-1], n+1) # labels, time index time_index = df.index[0] + np.array([pd.Timedelta(str(time_delta)+'h') for time_delta in ts_cutoff]) # create a categorical reference variables df['start_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[:-1]) # for clarity, reassign labels using end-period index df['end_time_index'] = pd.cut(time_delta_in_hours, bins=10, labels=time_index[1:]) Out[89]: A B start_time_index end_time_index 2015-01-01 00:00:00 1.7641 0.4002 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-01 08:00:00 0.9787 2.2409 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-01 16:00:00 1.8676 -0.9773 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-02 00:00:00 0.9501 -0.1514 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-02 08:00:00 -0.1032 0.4106 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-02 16:00:00 0.1440 1.4543 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-03 00:00:00 0.7610 0.1217 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-03 08:00:00 0.4439 0.3337 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-03 16:00:00 1.4941 -0.2052 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-04 00:00:00 0.3131 -0.8541 2015-01-01 00:00:00 2015-01-04 07:12:00 2015-01-04 08:00:00 -2.5530 0.6536 2015-01-04 07:12:00 2015-01-07 14:24:00 2015-01-04 16:00:00 0.8644 -0.7422 2015-01-04 07:12:00 2015-01-07 14:24:00 2015-01-05 00:00:00 2.2698 -1.4544 2015-01-04 07:12:00 2015-01-07 14:24:00 2015-01-05 08:00:00 0.0458 -0.1872 2015-01-04 07:12:00 2015-01-07 14:24:00 2015-01-05 16:00:00 1.5328 1.4694 2015-01-04 07:12:00 2015-01-07 14:24:00 ... ... ... ... ... 2015-01-29 08:00:00 0.9209 0.3187 2015-01-27 09:36:00 2015-01-30 16:48:00 2015-01-29 16:00:00 0.8568 -0.6510 2015-01-27 09:36:00 2015-01-30 16:48:00 2015-01-30 00:00:00 -1.0342 0.6816 2015-01-27 09:36:00 2015-01-30 16:48:00 2015-01-30 08:00:00 -0.8034 -0.6895 2015-01-27 09:36:00 2015-01-30 16:48:00 2015-01-30 16:00:00 -0.4555 0.0175 2015-01-27 09:36:00 2015-01-30 16:48:00 2015-01-31 00:00:00 -0.3540 -1.3750 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-01-31 08:00:00 -0.6436 -2.2234 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-01-31 16:00:00 0.6252 -1.6021 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-01 00:00:00 -1.1044 0.0522 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-01 08:00:00 -0.7396 1.5430 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-01 16:00:00 -1.2929 0.2671 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-02 00:00:00 -0.0393 -1.1681 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-02 08:00:00 0.5233 -0.1715 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-02 16:00:00 0.7718 0.8235 2015-01-30 16:48:00 2015-02-03 00:00:00 2015-02-03 00:00:00 2.1632 1.3365 2015-01-30 16:48:00 2015-02-03 00:00:00 [100 rows x 4 columns] df.groupby('start_time_index').agg('sum') Out[90]: A B start_time_index 2015-01-01 00:00:00 8.6133 2.7734 2015-01-04 07:12:00 1.9220 -0.8069 2015-01-07 14:24:00 -8.1334 0.2318 2015-01-10 21:36:00 -2.7572 -4.2862 2015-01-14 04:48:00 1.1957 7.2285 2015-01-17 12:00:00 3.2485 6.6841 2015-01-20 19:12:00 -0.8903 2.2802 2015-01-24 02:24:00 -2.1025 1.3800 2015-01-27 09:36:00 -1.1017 1.3108 2015-01-30 16:48:00 -0.0902 -2.5178 </code></pre> <p>另一种较短的方法是将采样频率指定为时间增量。但问题是，正如下面所示，它提供了11个子样本，而不是10个子样本。我认为原因是<code>resample</code>实施了<code>left-inclusive/right-exclusive (or left-exclusive/right-inclusive)</code>子抽样方案，因此在'2015-02-03 00:00:00'的最后一个obs被视为一个单独的组。如果我们自己用<code>pd.cut</code>来做，我们可以指定<code>include_lowest=True</code>，这样它就可以给出10个子样本而不是11个子样本。在</p> ^{pr2}$

如何将具有datetime索引的df重采样到n个大小相等的时段？

1 个回答

相关Python问题