如何优化df.apply函数

2条回答

网友

1楼 · 编辑于 2024-10-03 06:30:08

案例1：垃圾箱的大小不一

在这种情况下，我能想到的最好方法是使用pd.cut：

mapper = pd.Series(df_b['Value'])
mapper.index = df_b['StartDateTime']

cutoffs = df_b['StartDateTime'].copy()
cutoffs[cutoffs.index.max() + 1] = df_b['EndDateTime'].max()

bins = pd.cut(df_a['DateTime'], bins=cutoffs)
df_a['Value'] = mapper.loc[pd.IntervalIndex(bins).left].values

您可以创建一个系列来将开始时间映射到值。然后，您将创建另一个表示截止点的系列，数据帧A中的时间将被放入其中（请注意，您需要手动添加最后一个结束时间）。然后用pd.cut将时间放入那些截止值中，并使用bin的left值来loc映射序列

案例2：垃圾箱大小相同

看起来OP的垃圾箱是5分钟的一大块。如果这是正确的，您可以利用pd.Series.dt.floor()将时间从数据帧A快速转换为可以索引数据帧B的时间：

mapper = pd.Series(df_b['Value'])
mapper.index = df_b['StartDateTime']
df_a['Value'] = mapper.loc[df_a['DateTime'].dt.floor('5T')].values

定时：

以下是我使用的示例数据：

import numpy as np
import pandas as pd

size = 100 # tweak this to see each option at scale

dr1 = pd.date_range('01-01-2020 06:00:00', freq='5T', periods=size)
dr2 = pd.date_range('01-01-2020 06:05:00', freq='5T', periods=size)
drA = pd.to_datetime({'year':dr1.year, 'month':dr1.month,
                      'day':dr1.day, 'hour':dr1.hour,
                      'minute':np.random.randint(1,60,len(dr1)),
                      'second':np.random.randint(1,60,len(dr1))}).sort_values()
drA = drA[drA < dr2.max()]

df_a = pd.DataFrame({'DateTime':drA, 'A':range(len(drA))})
df_b = pd.DataFrame({'StartDateTime':dr1, 'EndDateTime':dr2, 'Value':np.random.rand(len(dr2))})

使用%%timeit和size=100的结果：

apply：每个循环61毫秒±851微秒（平均±标准偏差为7次，每个循环10次）
pd.cut：每个循环8.98 ms±107µs（7次循环的平均值±标准偏差，每个循环100次）
dt.floor：每个循环865µs±17.8µs（7次运行的平均值±标准偏差，每个循环1000次）
添加@Rik Kraan的答案，每个循环np.where*：1.85 ms±7.8µs（7次循环的平均±标准偏差，每个循环1000次）

*这个答案比我的pd.cut好得多，但是当把size增加到1000000时，我也得到了一个MemoryError: Unable to allocate 931. GiB for an array with shape (999999, 1000000) and data type bool

因此，发言速度明显快于原始方法。但如果你的垃圾箱不是平均分配的，那就不对了。您可以使用df_b['StartDateTime'].dt.minute.unique()或df_b['StartDateTime'].dt.time.unique()检查这一点。如果可以找到合适的楼层值，甚至可以迭代使用多个楼层值

但是pd.cut版本仍然是一个显著的改进；也许还有一些我没有看到的优化

网友

2楼 · 编辑于 2024-10-03 06:30:08

让我们首先创建两个数组，返回dfsA&B其中满足条件（A['DateTime']介于B['StartDateTime']&；B['EndDateTime']

i, j = np.where(
(A['DateTime'].values[:, None] >= B['StartDateTime'].values) & 
(A['DateTime'].values[:, None] <= B['EndDateTime'].values)
)

选择与这些索引对应的数据帧A和B中的行，并创建一个新的数据帧

pd.DataFrame(
    np.column_stack([A.values[i], B.values[j]]),
    columns=A.columns.append(B.columns)
)

案例1：垃圾箱的大小不一

案例2：垃圾箱大小相同

定时：

相关问题更多 >

编程相关推荐

热门问题

热门文章