Python For循环在增加迭代次数后变慢

2024-09-28 21:22:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图理解为什么循环在增加迭代次数后会减慢。这段代码只是从API复制数据的一些实际代码的模拟。我必须批量下载数据,因为如果我一次下载所有数据,内存就会耗尽。但是,我对批处理的循环实现不是很理想。我怀疑使用熊猫会增加开销,但除此之外,还有什么可能导致问题

import timeit
import pandas as pd
from tqdm import tqdm


def some_generator():
    for i in range(1_000_000):
        yield {
            'colA': 'valA',
            'colB': 'valA',
            'colC': 'valA',
            'colD': 'valA',
            'colE': 'valA',
            'colF': 'valA',
            'colG': 'valA',
            'colH': 'valA',
            'colI': 'valA',
            'colJ': 'valA'
        }


def main():
    batch_size = 10_000
    generator = some_generator()
    output = pd.DataFrame()
    batch_round = 1

    while True:

        for _ in tqdm(range(batch_size), desc=f"Batch {batch_round}"):

            try:
                row = next(generator)
                row.pop('colA')
                output = pd.concat([output, pd.DataFrame(row, index=[0])], ignore_index=True)

            except StopIteration:
                break

        if output.shape[0] != batch_size * batch_round:
            break
        else:
            batch_round += 1

    print(output)

这段代码模拟的是一个1M行数据帧,如果我以10k的批量下载数据,这就是我在前20批中得到的性能

Batch 1: 100%|██████████| 10000/10000 [00:21<00:00, 460.89it/s]
Batch 2: 100%|██████████| 10000/10000 [00:28<00:00, 349.16it/s]
Batch 3: 100%|██████████| 10000/10000 [00:38<00:00, 263.12it/s]
Batch 4: 100%|██████████| 10000/10000 [00:43<00:00, 228.76it/s]
Batch 5: 100%|██████████| 10000/10000 [00:53<00:00, 187.44it/s]
Batch 6: 100%|██████████| 10000/10000 [01:02<00:00, 159.92it/s]
Batch 7: 100%|██████████| 10000/10000 [01:09<00:00, 144.79it/s]
Batch 8: 100%|██████████| 10000/10000 [01:18<00:00, 127.59it/s]
Batch 9: 100%|██████████| 10000/10000 [01:25<00:00, 116.92it/s]
Batch 10: 100%|██████████| 10000/10000 [01:34<00:00, 105.96it/s]
Batch 11: 100%|██████████| 10000/10000 [01:40<00:00, 99.81it/s]
Batch 12: 100%|██████████| 10000/10000 [01:46<00:00, 93.92it/s]
Batch 13: 100%|██████████| 10000/10000 [01:55<00:00, 86.49it/s]
Batch 14: 100%|██████████| 10000/10000 [02:03<00:00, 80.92it/s]
Batch 15: 100%|██████████| 10000/10000 [02:10<00:00, 76.46it/s]
Batch 16: 100%|██████████| 10000/10000 [02:18<00:00, 71.99it/s]
Batch 17: 100%|██████████| 10000/10000 [02:25<00:00, 68.69it/s]
Batch 18: 100%|██████████| 10000/10000 [02:32<00:00, 65.57it/s]
Batch 19: 100%|██████████| 10000/10000 [02:42<00:00, 61.53it/s]
Batch 20: 100%|██████████| 10000/10000 [02:39<00:00, 62.84it/s]

Tags: 数据代码importoutputsizebatchit批量
1条回答
网友
1楼 · 发布于 2024-09-28 21:22:35

Pd.Concat价格昂贵->

在这里,您可以做什么-使用一个空列表并将行dict附加到该特定列表。最后,在所有操作之后,将输出转换回数据帧。这样会非常快:)

import timeit
import pandas as pd
from tqdm import tqdm


def some_generator():
    for _ in range(1_000_000):
        yield {
            'colA': 'valA',
            'colB': 'valA',
            'colC': 'valA',
            'colD': 'valA',
            'colE': 'valA',
            'colF': 'valA',
            'colG': 'valA',
            'colH': 'valA',
            'colI': 'valA',
            'colJ': 'valA'
        }


def main():
    batch_size = 10_000
    generator = some_generator()
    output = []
    batch_round = 1

    while True:

        for _ in tqdm(range(batch_size), desc=f"Batch {batch_round}"):

            try:
                row = next(generator)
                row.pop('colA')
                output.append(row)

            except for StopIteration:
                break

        shape = len(output)  
        if shape != batch_size * batch_round:
            break
        else:
            batch_round += 1
            

    # print(pd.DataFrame(output))

main()

输出-

Batch 1: 100%|██████████| 10000/10000 [00:00<00:00, 826724.48it/s]
Batch 2: 100%|██████████| 10000/10000 [00:00<00:00, 978765.55it/s]
Batch 3: 100%|██████████| 10000/10000 [00:00<00:00, 1072629.72it/s]
Batch 4: 100%|██████████| 10000/10000 [00:00<00:00, 1267237.90it/s]
Batch 5: 100%|██████████| 10000/10000 [00:00<00:00, 1351301.27it/s]
Batch 6: 100%|██████████| 10000/10000 [00:00<00:00, 1402918.02it/s]
Batch 7: 100%|██████████| 10000/10000 [00:00<00:00, 1374370.54it/s]
Batch 8: 100%|██████████| 10000/10000 [00:00<00:00, 1435520.57it/s]
Batch 9: 100%|██████████| 10000/10000 [00:00<00:00, 1499947.79it/s]
Batch 10: 100%|██████████| 10000/10000 [00:00<00:00, 1458381.08it/s]
Batch 11: 100%|██████████| 10000/10000 [00:00<00:00, 1366178.30it/s]
Batch 12: 100%|██████████| 10000/10000 [00:00<00:00, 1396844.17it/s]
Batch 13: 100%|██████████| 10000/10000 [00:00<00:00, 1376309.76it/s]
Batch 14: 100%|██████████| 10000/10000 [00:00<00:00, 1453881.94it/s]
Batch 15: 100%|██████████| 10000/10000 [00:00<00:00, 1373245.59it/s]
Batch 16: 100%|██████████| 10000/10000 [00:00<00:00, 1470756.72it/s]
Batch 17: 100%|██████████| 10000/10000 [00:00<00:00, 1450964.82it/s]
Batch 18: 100%|██████████| 10000/10000 [00:00<00:00, 1495882.16it/s]
Batch 19: 100%|██████████| 10000/10000 [00:00<00:00, 1477960.46it/s]
Batch 20: 100%|██████████| 10000/10000 [00:00<00:00, 1479733.29it/s]
Batch 21: 100%|██████████| 10000/10000 [00:00<00:00, 1383528.17it/s]
Batch 22: 100%|██████████| 10000/10000 [00:00<00:00, 1361521.78it/s]
Batch 23: 100%|██████████| 10000/10000 [00:00<00:00, 1420594.07it/s]
Batch 24: 100%|██████████| 10000/10000 [00:00<00:00, 1468850.99it/s]
Batch 25: 100%|██████████| 10000/10000 [00:00<00:00, 1477960.46it/s]
Batch 26: 100%|██████████| 10000/10000 [00:00<00:00, 1055755.13it/s]
Batch 27: 100%|██████████| 10000/10000 [00:00<00:00, 952104.06it/s]
Batch 28: 100%|██████████| 10000/10000 [00:00<00:00, 1260231.96it/s]
Batch 29: 100%|██████████| 10000/10000 [00:00<00:00, 1433705.01it/s]
Batch 30: 100%|██████████| 10000/10000 [00:00<00:00, 1404703.44it/s]

相关问题 更多 >