我试图理解为什么循环在增加迭代次数后会减慢。这段代码只是从API复制数据的一些实际代码的模拟。我必须批量下载数据,因为如果我一次下载所有数据,内存就会耗尽。但是,我对批处理的循环实现不是很理想。我怀疑使用熊猫会增加开销,但除此之外,还有什么可能导致问题
import timeit
import pandas as pd
from tqdm import tqdm
def some_generator():
for i in range(1_000_000):
yield {
'colA': 'valA',
'colB': 'valA',
'colC': 'valA',
'colD': 'valA',
'colE': 'valA',
'colF': 'valA',
'colG': 'valA',
'colH': 'valA',
'colI': 'valA',
'colJ': 'valA'
}
def main():
batch_size = 10_000
generator = some_generator()
output = pd.DataFrame()
batch_round = 1
while True:
for _ in tqdm(range(batch_size), desc=f"Batch {batch_round}"):
try:
row = next(generator)
row.pop('colA')
output = pd.concat([output, pd.DataFrame(row, index=[0])], ignore_index=True)
except StopIteration:
break
if output.shape[0] != batch_size * batch_round:
break
else:
batch_round += 1
print(output)
这段代码模拟的是一个1M行数据帧,如果我以10k的批量下载数据,这就是我在前20批中得到的性能
Batch 1: 100%|██████████| 10000/10000 [00:21<00:00, 460.89it/s]
Batch 2: 100%|██████████| 10000/10000 [00:28<00:00, 349.16it/s]
Batch 3: 100%|██████████| 10000/10000 [00:38<00:00, 263.12it/s]
Batch 4: 100%|██████████| 10000/10000 [00:43<00:00, 228.76it/s]
Batch 5: 100%|██████████| 10000/10000 [00:53<00:00, 187.44it/s]
Batch 6: 100%|██████████| 10000/10000 [01:02<00:00, 159.92it/s]
Batch 7: 100%|██████████| 10000/10000 [01:09<00:00, 144.79it/s]
Batch 8: 100%|██████████| 10000/10000 [01:18<00:00, 127.59it/s]
Batch 9: 100%|██████████| 10000/10000 [01:25<00:00, 116.92it/s]
Batch 10: 100%|██████████| 10000/10000 [01:34<00:00, 105.96it/s]
Batch 11: 100%|██████████| 10000/10000 [01:40<00:00, 99.81it/s]
Batch 12: 100%|██████████| 10000/10000 [01:46<00:00, 93.92it/s]
Batch 13: 100%|██████████| 10000/10000 [01:55<00:00, 86.49it/s]
Batch 14: 100%|██████████| 10000/10000 [02:03<00:00, 80.92it/s]
Batch 15: 100%|██████████| 10000/10000 [02:10<00:00, 76.46it/s]
Batch 16: 100%|██████████| 10000/10000 [02:18<00:00, 71.99it/s]
Batch 17: 100%|██████████| 10000/10000 [02:25<00:00, 68.69it/s]
Batch 18: 100%|██████████| 10000/10000 [02:32<00:00, 65.57it/s]
Batch 19: 100%|██████████| 10000/10000 [02:42<00:00, 61.53it/s]
Batch 20: 100%|██████████| 10000/10000 [02:39<00:00, 62.84it/s]
Pd.Concat价格昂贵->
在这里,您可以做什么-使用一个空列表并将行dict附加到该特定列表。最后,在所有操作之后,将输出转换回数据帧。这样会非常快:)
输出-
相关问题 更多 >
编程相关推荐