对pandas中的多个.csv文件应用相同的操作问题的回答

对pandas中的多个.csv文件应用相同的操作

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

您可以使用一个列表来保存所有数据帧： <pre><code>number_of_files = 6 dfs = [] for file_num in range(len(number_of_files)): dfs.append(pd.read_csv(f"yellow_tripdata_2018-0{file_num}.csv")) #I use Python 3.6, so I'm used to f-strings now. If you're using Python <3.6 use .format() </code></pre> 然后，要获取特定的数据帧，请使用： ^{pr2}$ 编辑： 由于您试图避免在内存中加载所有这些内容，所以我将使用流媒体技术。尝试将for循环改为如下所示： <pre><code>for file_num in range(len(number_of_files)): with open(f"yellow_tripdata_2018-0{file_num}.csv", 'wb') as f: dfs.append(csv.reader(iter(f.readline, ''))) </code></pre> 然后只需在<code>dfs[n]</code>或<code>next(dfs[n])</code>上使用for循环将每一行读入内存。在 p.S. 您可能需要多线程在相同的时间迭代每个线程。在 加载/编辑/保存：-使用<code>csv</code>模块 好吧，我做了很多研究，python的<code>csv</code>模块每次只加载一行，很可能是在我们打开它的模式下。（解释了<a href="https://stackoverflow.com/a/28277372/225020">here</a>） 如果您不想使用<a href="https://pandas.pydata.org/" rel="nofollow noreferrer">Pandas</a>（哪种分块可能是答案，那么就在@seralouk的答案中实现它，如果是的话），否则，是的！在我看来，这是最好的方法，我们只需要改变一些事情。在 <pre><code>number_of_files = 6 filename = "yellow_tripdata_2018-{}.csv" for file_num in range(number_of_files): #notice I'm opening the original file as f in mode 'r' for read only #and the new file as nf in mode 'a' for append with open(filename.format(str(file_num).zfill(2)), 'r') as f, open(filename.format((str(file_num)+"-new").zfill(2)), 'a') as nf: #initialize the writer before looping every line w = csv.writer(nf) for row in csv.reader(f): #do your "data cleaning" (THIS IS PER-LINE REMEMBER) #save to file w.writerow(row) </code></pre> 注意： {a4}我想找一个更容易理解的作家。在 熊猫法 <a href="https://stackoverflow.com/a/29908754/225020">PLEASE READ this answer</a>-如果你想远离我的csv方法而坚持使用Pandas:）这看起来和你的问题是一样的，答案就是你的要求。在 基本上Panda允许您将一个文件部分加载为块，执行任何更改，然后您可以将这些块写入新文件。下面主要是这个答案，但我确实在文档中做了更多的阅读 <pre><code>number_of_files = 6 chunksize = 500 #find the chunksize that works best for you filename = "yellow_tripdata_2018-{}.csv" for file_num in range(number_of_files): for chunk in pd.read_csv(filename.format(str(file_num).zfill(2))chunksize=ch) # Do your data cleaning chunk.to_csv(filename.format((str(file_num)+"-new").zfill(2)), mode='a') #see again we're doing it in append mode so it creates the file in chunks </code></pre> 关于数据分块的更多信息，请参见<a href="http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking" rel="nofollow noreferrer">here</a>，对于那些像你这样因这些内存问题而头疼的人来说，这本书也是一本不错的读物。在

对pandas中的多个.csv文件应用相同的操作

1 个回答

相关Python问题