使用Python datafram高效地将数百万行写入文件

filepath = '/path/to/file.csv' def df_to_file: df = pd.read_csv(filepath) f = open('output_file', 'w') for i in range(len(df.index)): if df['col1'].iloc[i] != '': key1 = str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i]) val1 = df['col_n+1'].iloc[i] key1a = str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i]) val1a = df['col_n+2'].iloc[i] print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f) if df['col2'].iloc[i] != '': key1 = str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i]) val1 = df['col_n+1'].iloc[i] key1a = str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i]) val1a = df['col_n+2'].iloc[i] print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f) if df['col3'].iloc[i] != '': key1 = str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i]) val1 = df['col_n+1'].iloc[i] key1a = str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i]) val1a = df['col_n+2'].iloc[i] print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f) f.close() p = Process(target = df_to_file) p.start() p.join()

1条回答

网友

1楼 · 发布于 2024-09-27 04:27:50

使用像df['col1'].loc[...]这样的结构来循环单个行是很慢的，基于iloc和loc的选择器用于在整个数据帧中选择，并执行大量与索引对齐相关的工作，如果对每一行执行这些操作，则会有很高的开销。相反，简单地使用df.itertuples()来迭代行将大大加快。在

def df_to_file:
    df = pd.read_csv(filepath)
    f = open('output_file', 'wb') # writing in binary mode should be faster, if it is possible without unicode problems
    for row in df.itertuples():
        if row.col1:
            key1, val1 = string1, string2
            key1a, val1a = string1a, string2a
            print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
        if row.col2:
            key1, val1 = string1, string2
            key1a, val1a = string1a, string2a
            print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
        if row.col3:
            key1, val1 = string1, string2
            key1a, val1a = string1a, string2a
            print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
    f.close()

这也许是你能做的最低限度的优化。如果你更详细地描述你正在做什么，也许可以找到一个矢量化的解决方案。在

另外，不要将上述内容与multiprocessing一起使用。在

而且，正如所写，'SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a)将始终是相同的。如果这些参数没有改变，那么只需在循环外执行一次字符串连接，然后在循环中重用整个字符串。在

编辑：似乎你不能这么做，但鉴于：

This particular dataset has 6 million rows and 10 columns, mostly comprised of strings with a few float columns. The Redis keys are the strings and the float values are the Redis values in the key-value pair.

那么只要key1 = ''.join(row.col1, row.col4, row.col5, ...)不要使用str和+运算符，这是非常低效的，因为你暗示那些列已经是字符串了。如果您必须对所有这些列调用str，请使用map(str, ...)

最后，如果您真的需要压缩性能，请注意row将是namedtuple对象，它是元组，并且您可以使用基于整数的索引而不是基于属性的标签访问，即row[1]而不是{}（注意，row[0]将是{}，i、索引）`这应该更快（而且这会产生不同的效果，因为每次迭代都会对元组进行几十次索引，并进行数百万次迭代）。在

相关问题更多 >

编程相关推荐

热门问题

热门文章