我正在尝试用Python从一个大约3亿行和大约200gb的大文件中读写数据。我已经能够让基本的代码工作,但想并行化,使它运行得更快。为此,我一直遵循以下指南:https://www.blopig.com/blog/2016/08/processing-large-files-using-python/。然而,当我尝试并行化代码时,我得到一个错误:“TypeError:worker()参数在*之后必须是iterable,而不是int”。如何让代码运行?您对提高效率有何建议?请注意,我对Python还比较陌生
基本代码(其中设置了id\u pct1和id\u pct001):
with open(file1) as f, open('file1', 'w') as out_f1, open('file2', 'w') as out_f001:
for line in f:
data = line.split('*')
if data[30] in id_pct1: out_f1.write(line)
if data[30] in id_pct001: out_f001.write(line)
并行代码:
def worker(lineByte):
with open(file1) as f, open('file1', 'w') as out_f1, open('file2', 'w') as out_f001:
f.seek(lineByte)
line = f.readline()
data = line.split('*')
if data[30] in id_pct1: out_f1.write(line)
if data[30] in id_pct001: out_f001.write(line)
def main():
pool = mp.Pool()
jobs = []
with open('Subsets/FirstLines.txt') as f:
nextLineByte = 0
for line in f:
jobs.append(pool.apply_async(worker,(nextLineByte)))
nextLineByte += len(line)
for job in jobs:
job.get()
pool.close()
if __name__ == '__main__':
main()
尝试
pool.apply\u async()需要iterable
(nextLineByte)充当int,这是抛出的错误
相关问题 更多 >
编程相关推荐