基于pandas的部分数据帧快速生成

2020/03/05 14:59:12.093,92.7884,93.8238 2020/03/05 14:59:14.571,97.1114,51.3926 2020/03/05 14:59:16.035,56.1351,62.6697 : 2020/03/05 15:01:11.652,90.6966,37.9923 2020/03/05 15:01:11.918,35.8304,1.04157

df = pd.read_csv(file_name, header=None, names=['time', 'colA', 'colB']) df['time'] = pd.to_datetime(df['time'], format=r'%Y/%m/%d %H:%M:%S.%f') df = df.set_index('time') extracted_dfs = [] startdatetime = df.index[0] enddatetime = df.index[len(df)-1] curdatetime = startdatetime while curdatetime < enddatetime: extracted_df = df[curdatetime:curdatetime + pd.Timedelta(seconds=120)].copy() extracted_dfs.append(extracted_df) curdatetime = curdatetime + pd.Timedelta(seconds=20)

1条回答

网友

1楼 · 发布于 2024-07-18 13:13:51

我在我的2.67GHz笔记本电脑上的时间不到6秒。在24小时内使用了2M行并提取了4320个dfs，我猜这是一个足够好的规模测试

似乎我们把curdatetime + pd.Timedelta()从循环中去掉，节省了很多时间

### toy dataframe
start = pd.to_datetime('2020-03-05 14:00')
n = int(2e6)
df = pd.DataFrame(
    {'A': np.random.choice(100, n), 'B': np.random.choice(100, n)},
    index=start + pd.to_timedelta(np.random.rand(n)*86400, unit='seconds')
    ).sort_index()

t0 = time()

### build all start datetimes for windows
gtimes = np.arange(start=df.index[0], stop=df.index[-1],
    step=pd.Timedelta(20, unit='seconds'))
extracted_dfs = [df.loc[gt:lt] for gt, lt in
    zip(gtimes, gtimes + pd.Timedelta(120, unit='seconds'))]


print(f'runtime: {time() - t0}s')
print(*extracted_dfs[:2], sep='\n\n')

输出

runtime: 5.9694719314575195s
                                A   B
2020-03-05 14:00:00.029956126  38  47
2020-03-05 14:00:00.043794997  19  93
2020-03-05 14:00:00.274295160  24  26
2020-03-05 14:00:00.345806566   7  96
2020-03-05 14:00:00.358988998  83  18
...                            ..  ..
2020-03-05 14:01:59.811072868  45  75
2020-03-05 14:01:59.895038311  36  26
2020-03-05 14:01:59.936082342  78   6
2020-03-05 14:01:59.974735739  17  25
2020-03-05 14:01:59.985301083   1  34

[2802 rows x 2 columns]

                                A   B
2020-03-05 14:00:20.037424719  95  49
2020-03-05 14:00:20.071532168  70  37
2020-03-05 14:00:20.086438199  46  45
2020-03-05 14:00:20.197759064  60  61
2020-03-05 14:00:20.261713915  31  20
...                            ..  ..
2020-03-05 14:02:19.633312110  30  34
2020-03-05 14:02:19.646400725  50   2
2020-03-05 14:02:19.804335407  40  75
2020-03-05 14:02:19.841056690  18  75
2020-03-05 14:02:19.857622011  90  46

[2768 rows x 2 columns]

相关问题更多 >

编程相关推荐

热门问题

热门文章