如何在pandas中优化数据帧迭代？

df = pd.read_csv(file, sep='\t', dtype=str, na_values="", low_memory=False) row_ids = [] for index, row in df.iterrows(): if (index % 1000) == 0: print("Row node index: {}".format(str(index))) caculated_id = get_id(row['name', row['sex']]) row_ids.append(caculated_id) df['id'] = row_ids

2条回答

网友

1楼 · 编辑于 2024-06-25 23:52:35

改用apply：

def func(x):
    if (x.name % 1000) == 0:
        print("Row node index: {}".format(str(x.name)))
 
    caculated_id = get_id(row['name', row['sex']])
    return caculated_id

df['id'] = df.apply(func, axis=1)

网友

2楼 · 编辑于 2024-06-25 23:52:35

如果您正在处理一个大型数据集，那么np.vectorize()应该有助于绕过apply()开销，这应该会快一点

import numpy as np

v = np.vectorize(lambda x: get_id(x['name'], x['sex']))
df['id'] = v(df)

编辑：

为了获得更高的速度，您也可以只传递函数get_id而不是使用lambda函数，并传递df.*.values而不是df.*

v = np.vectorize(get_id)
df['id'] = v(df['name'].values, df['sex'].values)

尝试使用tqdm使用进度条显示进度，而不是在整个过程中打印进度的更新

import numpy as np 
from tqdm import tqdm

@np.vectorize
def get_id(name, sex):
    global pbar
    ...
    pbar.update(1)
    ...
    return 


global pbar
with tqdm(total=len(df)) as pbar:
    df['id'] = get_id(df['name'].values, df['sex'].values)

相关问题更多 >

编程相关推荐

热门问题

热门文章