在Python Pandas中连接大量CSV文件(30000)

2024-10-02 22:23:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用以下函数连接大量CSV文件:

def concatenate():
    files = sort() # input is an array of filenames
    merged = pd.DataFrame()
    for file in files:
        print "concatinating" + file
        if file.endswith('FulltimeSimpleOpt.csv'): # only consider those filenames
            filenamearray = file.split("_")
            f = pd.read_csv(file, index_col=0)
            f.loc[:,'Vehicle'] = filenamearray[0].replace("veh", "")
            f.loc[:,'Year'] = filenamearray[1].replace("year", "")
            if "timelimit" in file:
                f.loc[:,'Timelimit'] = "1"
            else:
                f.loc[:,'Timelimit'] = "0"
            merged = pd.concat([merged, f], axis=0)
    merged.to_csv('merged.csv')

这个函数的问题是它不能很好地处理大量的文件(30000)。我试着用100个文件的样本来完成。但是,对于30000个文件,脚本会减慢速度并在某个时候崩溃。在

如何在Python Pandas中更好地处理大量文件?在


Tags: 文件csv函数iniffilesmergedloc
1条回答
网友
1楼 · 发布于 2024-10-02 22:23:36

首先列出dfs列表,然后连接:

def concatenate():
    files = sort() # input is an array of filenames
    df_list =[]
    #merged = pd.DataFrame()
    for file in files:
        print "concatinating" + file
        if file.endswith('FulltimeSimpleOpt.csv'): # only consider those filenames
            filenamearray = file.split("_")
            f = pd.read_csv(file, index_col=0)
            f.loc[:,'Vehicle'] = filenamearray[0].replace("veh", "")
            f.loc[:,'Year'] = filenamearray[1].replace("year", "")
            if "timelimit" in file:
                f.loc[:,'Timelimit'] = "1"
            else:
                f.loc[:,'Timelimit'] = "0"
            df_list.append(f)
    merged = pd.concat(df_list, axis=0)
    merged.to_csv('merged.csv')

你要做的是通过反复连接来逐步增加你的df,最好是列一个df的列表,然后一次连接所有的df

相关问题 更多 >