对pandas中的多个.csv文件应用相同的操作

3条回答

网友

1楼 · 编辑于 2024-06-28 11:14:37

使用glob.glob可获取名称相似的所有文件：

import glob
files = glob.glob("yellow_tripdata_2018-0?.csv")
for f in files:
    df = pd.read_csv(f)
    # manipulate df
    df.to_csv(f)

这将匹配yellow_tripdata_2018-0<any one character>.csv。您还可以使用yellow_tripdata_2018-0*.csvtoo matchyellow_tripdata_2018-0<anything>.csv甚至yellow_tripdata_*.csv来匹配以yellow_tripdata开头的所有csv文件。在

请注意，这一次也只加载一个文件。在

网友

2楼 · 编辑于 2024-06-28 11:14:37

您可以使用一个列表来保存所有数据帧：

number_of_files = 6
dfs = []

for file_num in range(len(number_of_files)):
    dfs.append(pd.read_csv(f"yellow_tripdata_2018-0{file_num}.csv")) #I use Python 3.6, so I'm used to f-strings now. If you're using Python <3.6 use .format()

然后，要获取特定的数据帧，请使用：

^{pr2}$

编辑：

由于您试图避免在内存中加载所有这些内容，所以我将使用流媒体技术。尝试将for循环改为如下所示：

for file_num in range(len(number_of_files)):
    with open(f"yellow_tripdata_2018-0{file_num}.csv", 'wb') as f:
        dfs.append(csv.reader(iter(f.readline, '')))

然后只需在dfs[n]或next(dfs[n])上使用for循环将每一行读入内存。在

p.S.

您可能需要多线程在相同的时间迭代每个线程。在

加载/编辑/保存：-使用csv模块

好吧，我做了很多研究，python的csv模块每次只加载一行，很可能是在我们打开它的模式下。（解释了here）

如果您不想使用Pandas（哪种分块可能是答案，那么就在@seralouk的答案中实现它，如果是的话），否则，是的！在我看来，这是最好的方法，我们只需要改变一些事情。在

number_of_files = 6
filename = "yellow_tripdata_2018-{}.csv"

for file_num in range(number_of_files):
    #notice I'm opening the original file as f in mode 'r' for read only
    #and the new file as nf in mode 'a' for append
    with open(filename.format(str(file_num).zfill(2)), 'r') as f,
         open(filename.format((str(file_num)+"-new").zfill(2)), 'a') as nf:
        #initialize the writer before looping every line
        w = csv.writer(nf)
        for row in csv.reader(f):
            #do your "data cleaning" (THIS IS PER-LINE REMEMBER)
        #save to file
        w.writerow(row)

注意：

{a4}我想找一个更容易理解的作家。在

熊猫法

PLEASE READ this answer-如果你想远离我的csv方法而坚持使用Pandas:）这看起来和你的问题是一样的，答案就是你的要求。在

基本上Panda允许您将一个文件部分加载为块，执行任何更改，然后您可以将这些块写入新文件。下面主要是这个答案，但我确实在文档中做了更多的阅读

number_of_files = 6
chunksize = 500 #find the chunksize that works best for you
filename = "yellow_tripdata_2018-{}.csv"

for file_num in range(number_of_files):
    for chunk in pd.read_csv(filename.format(str(file_num).zfill(2))chunksize=ch)
        # Do your data cleaning
        chunk.to_csv(filename.format((str(file_num)+"-new").zfill(2)), mode='a') #see again we're doing it in append mode so it creates the file in chunks

关于数据分块的更多信息，请参见here，对于那些像你这样因这些内存问题而头疼的人来说，这本书也是一本不错的读物。在

网友
3楼 · 编辑于 2024-06-28 11:14:37

{使用。我每天都用这个：

number_of_files = 6

for i in range(1, number_of_files+1):
    df = pd.read_csv("yellow_tripdata_2018-0{}.csv".format(i)))

    #your code here, do analysis and then the loop will return and read the next dataframe

{使用。我每天都用这个：

相关问题更多 >

编程相关推荐

热门问题

热门文章