在python中，将具有不同报头的多个CSV读入一个datafram问题的回答

在python中，将具有不同报头的多个CSV读入一个datafram

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

假设您有以下CSV文件： 测试1.csv： <pre><code>year,month,day,Direct 1992,1,1,11 2013,5,30,11 2004,9,1,11 </code></pre> 测试2.csv： ^{pr2}$ 测试3.csv： <pre><code>year,month,day,File3 1992,1,1,text1 2013,5,30,text2 2004,9,1,text3 2016,1,1,unmatching_date </code></pre> 解决方案： <pre><code>import glob import pandas as pd files = glob.glob(r'd:/temp/test*.csv') def get_merged(files, **kwargs): df = pd.read_csv(files[0], **kwargs) for f in files[1:]: df = df.merge(pd.read_csv(f, **kwargs), how='outer') return df print(get_merged(files)) </code></pre> 输出： <pre><code> year month day Direct Direct Direct2 File3 0 1992 1 1 11.0 21.0 201.0 text1 1 2013 5 30 11.0 21.0 202.0 text2 2 2004 9 1 11.0 21.0 203.0 text3 3 2016 1 1 NaN NaN NaN unmatching_date </code></pre> 更新：通常惯用的<code>pd.concat(list_of_dfs)</code>解决方案在这里不起作用，因为它是通过索引连接的： <pre><code>In [192]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=0, ignore_index=True) Out[192]: Direct Direct Direct2 File3 day month year 0 NaN 11.0 NaN NaN 1 1 1992 1 NaN 11.0 NaN NaN 30 5 2013 2 NaN 11.0 NaN NaN 1 9 2004 3 21.0 NaN 201.0 NaN 1 1 1992 4 21.0 NaN 202.0 NaN 30 5 2013 5 21.0 NaN 203.0 NaN 1 9 2004 6 NaN NaN NaN text1 1 1 1992 7 NaN NaN NaN text2 30 5 2013 8 NaN NaN NaN text3 1 9 2004 9 NaN NaN NaN unmatching_date 1 1 2016 In [193]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=1, ignore_index=True) Out[193]: 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1 1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2 2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date </code></pre> 或显式使用<code>index_col=None</code>： <pre><code>In [194]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=0, ignore_index=True) Out[194]: Direct Direct Direct2 File3 day month year 0 NaN 11.0 NaN NaN 1 1 1992 1 NaN 11.0 NaN NaN 30 5 2013 2 NaN 11.0 NaN NaN 1 9 2004 3 21.0 NaN 201.0 NaN 1 1 1992 4 21.0 NaN 202.0 NaN 30 5 2013 5 21.0 NaN 203.0 NaN 1 9 2004 6 NaN NaN NaN text1 1 1 1992 7 NaN NaN NaN text2 30 5 2013 8 NaN NaN NaN text3 1 9 2004 9 NaN NaN NaN unmatching_date 1 1 2016 In [195]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=1, ignore_index=True) Out[195]: 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1 1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2 2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date </code></pre> 以下更惯用的解决方案有效，但它改变了列和行/数据的原始顺序： <pre><code>In [224]: dfs = [pd.read_csv(f, index_col=None) for f in glob.glob(r'd:/temp/test*.csv')] ...: ...: common_cols = list(set.intersection(*[set(x.columns.tolist()) for x in dfs])) ...: ...: pd.concat((df.set_index(common_cols) for df in dfs), axis=1).reset_index() ...: Out[224]: month day year Direct Direct Direct2 File3 0 1 1 1992 11.0 21.0 201.0 text1 1 1 1 2016 NaN NaN NaN unmatching_date 2 5 30 2013 11.0 21.0 202.0 text2 3 9 1 2004 11.0 21.0 203.0 text3 </code></pre>

在python中，将具有不同报头的多个CSV读入一个datafram

1 个回答

相关Python问题