从timeseries数据帧中选择最新值的Pythonic方法问题的回答

从timeseries数据帧中选择最新值的Pythonic方法

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个时间序列，每个日期时间包含多个值。每个datetime索引都有一个关联的datetime，在其中加载值，或“loadtime”，如下所示： <pre><code>import datetime as dt import numpy as np import pandas as pd # time-series index t = pd.date_range('09/01/2017', '09/02/2017', freq='1H') t = t.repeat(3) n = len(t) # data values y = np.full((n), 0.0) y = y.reshape(n//3, 3) y[:, 1] = 1.0 y[:, 2] = 2.0 y = y.flatten() # load timestamp random_range = np.arange(0, 60) base_date = np.datetime64('2017-10-01 12:00') loadtimes = [base_date + np.random.choice(random_range) for x in range(n)] df = pd.DataFrame(index=t, data={'y': y, 'loadtime': loadtimes}) >>> df.head(12) loadtime y 2017-09-01 00:00:00 2017-10-02 01:59:00 0.0 2017-09-01 00:00:00 2017-10-02 09:23:00 1.0 2017-09-01 00:00:00 2017-10-02 03:35:00 2.0 2017-09-01 01:00:00 2017-10-01 17:26:00 0.0 2017-09-01 01:00:00 2017-10-01 16:44:00 1.0 2017-09-01 01:00:00 2017-10-02 12:50:00 2.0 2017-09-01 02:00:00 2017-10-02 11:30:00 0.0 2017-09-01 02:00:00 2017-10-02 11:17:00 1.0 2017-09-01 02:00:00 2017-10-01 20:23:00 2.0 2017-09-01 03:00:00 2017-10-02 15:27:00 0.0 2017-09-01 03:00:00 2017-10-02 18:08:00 1.0 2017-09-01 03:00:00 2017-10-01 16:06:00 2.0 </code></pre> 到目前为止，我已经提出了这个迭代所有唯一值的解决方案…但是随着时间序列长度（和多个值）的增加，这可能会很昂贵。看上去有点像黑客，不太干净： <pre><code>new_index = df.index.unique() df_new = pd.DataFrame(index=new_index, columns=['y']) # cycle through unique indices to find max loadtime dfg = df.groupby(df.index) for i, dfg_i in dfg: max_index = dfg_i['loadtime'] == dfg_i['loadtime'].max() if i in df_new.index: df_new.loc[i, 'y'] = dfg_i.loc[max_index, 'y'].values[0] # WHY IS THIS A LIST? >>> df_new.head() y 2017-09-01 00:00:00 1 2017-09-01 01:00:00 2 2017-09-01 02:00:00 0 2017-09-01 03:00:00 1 2017-09-01 04:00:00 1 </code></pre> 如何为每个唯一索引获取具有最新“加载时间”的时间序列？有没有一个更能解馋的方法？你知道吗

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

首先从<code>DatetimeIndex</code>创建列，然后由<code>y</code>列创建<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html" rel="nofollow noreferrer">^{<cd2>}</a>。然后使用<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html" rel="nofollow noreferrer">^{<cd4>}</a>和<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.idxmax.html" rel="nofollow noreferrer">^{<cd5>}</a>什么返回索引（这里是<code>y</code>列值）按每个组<code>loadtime</code>中的最大值： <pre><code>print (df.rename_axis('dat') .reset_index() .set_index('y') .groupby('dat')['loadtime'] .idxmax() .to_frame('y')) y dat 2017-09-01 00:00:00 1.0 2017-09-01 01:00:00 2.0 2017-09-01 02:00:00 0.0 2017-09-01 03:00:00 1.0 </code></pre> 细节： <pre><code>print (df.rename_axis('dat') .reset_index() .set_index('y')) dat loadtime y 0.0 2017-09-01 00:00:00 2017-10-02 01:59:00 1.0 2017-09-01 00:00:00 2017-10-02 09:23:00 2.0 2017-09-01 00:00:00 2017-10-02 03:35:00 0.0 2017-09-01 01:00:00 2017-10-01 17:26:00 1.0 2017-09-01 01:00:00 2017-10-01 16:44:00 2.0 2017-09-01 01:00:00 2017-10-02 12:50:00 0.0 2017-09-01 02:00:00 2017-10-02 11:30:00 1.0 2017-09-01 02:00:00 2017-10-02 11:17:00 2.0 2017-09-01 02:00:00 2017-10-01 20:23:00 0.0 2017-09-01 03:00:00 2017-10-02 15:27:00 1.0 2017-09-01 03:00:00 2017-10-02 18:08:00 2.0 2017-09-01 03:00:00 2017-10-01 16:06:00 </code></pre> 时间安排： <pre><code>t = pd.date_range('01/01/2017', '12/25/2017', freq='1H') #len(df) #[25779 rows x 2 columns] In [225]: %timeit (df.rename_axis('dat').reset_index().set_index('y').groupby('dat')['loadtime'].idxmax().to_frame('y')) 1 loop, best of 3: 870 ms per loop In [226]: %timeit df.groupby(level=0).apply(lambda x : x.set_index('y').idxmax()).rename(columns={'loadtime':'y'}) 1 loop, best of 3: 4.96 s per loop </code></pre>

从timeseries数据帧中选择最新值的Pythonic方法

1 个回答

相关Python问题