python,将数据列表转换为datafram

2024-10-16 22:34:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我得到了一个txt文件的数据部分,并将其存储在一个列表中。数据应该是年份、数据1、数据2、数据3。它们在原始txt文件中用\t\t或\t分隔,因为我直接附加了数据行。现在我想把它放到一个数据框中去处理。dataframe有三列year、data1和data2。你知道吗

['2018\t  \t7,107\t4,394\t2,713', '2017\t \t16,478\t10,286\t6,192', '2016\t  \t15,944\t9,971\t5,973', '2015\t \t15,071\t9,079\t5,992', '2014\t  \t14,415\t8,596\t5,819', '2013\t \t14,259\t8,269\t5,990', '2012\t  \t14,010\t8,143\t5,867', '2011\t \t14,149\t8,126\t6,023', '2010\t  \t14,505\t7,943\t6,562', '2009\t \t14,632\t8,022\t6,610', '2008\t  \t14,207\t7,989\t6,218', '2007\t \t14,400\t8,085\t6,315', '2006\t  \t14,750\t8,017\t6,733', '2005\t \t14,497\t7,593\t6,904', '2004\t  \t14,155\t7,150\t7,005', '2003\t \t13,285\t6,457\t6,828', '2002\t  \t12,821\t6,190\t6,631', '2001\t \t12,702\t6,080\t6,622', '2000\t  \t11,942\t5,985\t5,957', '1999\t \t10,872\t5,824\t5,048', '2018\t   \t10,362\t5,793\t4,569', '2017\t \t9,546\t5,479\t4,067', '2016\t  \t9,222\t5,418\t3,804', '2015\t \t8,859\t5,363\t3,496', '2014\t  \t8,203\t5,099\t3,104', '2013\t \t7,766\t4,861\t2,905', '2012\t  \t7,091\t4,520\t2,571', '2011\t \t6,953\t4,526\t2,427', '2010\t  \t6,632\t4,509\t2,123', '2009\t \t5,929\t4,011\t1,918', '2008\t  \t5,909\t4,080\t1,829']

我想要一个列名为year,data1,data2,data3的dataframe

谢谢。你知道吗


Tags: 文件数据txtdataframeyeart3t2t4
2条回答

另一种方法是不使用regex(但不如Romans answer整洁),使用列表理解清理数据,然后放入dict中,从中创建数据帧:

data =  ['2018\t  \t7,107\t4,394\t2,713',              '2017\t \t16,478\t10,286\t6,192', 
         '2016\t  \t15,944\t9,971\t5,973',             '2015\t \t15,071\t9,079\t5,992', 
         '2014\t  \t14,415\t8,596\t5,819',             '2013\t \t14,259\t8,269\t5,990', 
         '2012\t  \t14,010\t8,143\t5,867',             '2011\t \t14,149\t8,126\t6,023', 
         '2010\t  \t14,505\t7,943\t6,562',             '2009\t \t14,632\t8,022\t6,610', 
         '2008\t  \t14,207\t7,989\t6,218',             '2007\t \t14,400\t8,085\t6,315', 
         '2006\t  \t14,750\t8,017\t6,733',             '2005\t \t14,497\t7,593\t6,904', 
         '2004\t  \t14,155\t7,150\t7,005',             '2003\t \t13,285\t6,457\t6,828', 
         '2002\t  \t12,821\t6,190\t6,631',             '2001\t \t12,702\t6,080\t6,622', 
         '2000\t  \t11,942\t5,985\t5,957',             '1999\t \t10,872\t5,824\t5,048', 
         '1998\t   \t10,362\t5,793\t4,569',            '1997\t \t9,546\t5,479\t4,067', 
         '1996\t  \t9,222\t5,418\t3,804',              '1995\t \t8,859\t5,363\t3,496', 
         '1994\t  \t8,203\t5,099\t3,104',              '1993\t \t7,766\t4,861\t2,905', 
         '1992\t  \t7,091\t4,520\t2,571',              '1991\t \t6,953\t4,526\t2,427', 
         '1990\t  \t6,632\t4,509\t2,123',              '1989\t \t5,929\t4,011\t1,918', 
         '1988\t  \t5,909\t4,080\t1,829']

# partition and clean the data
cleaned = [ [x.strip() for x in year.split("\t") if x.strip()] for year in data  ]
# make a dict
dataCleaned = {x:y for x,*y in cleaned}

print (dataCleaned)

import pandas as pd
df = pd.DataFrame(dataCleaned)

print(df)

输出:

# the dict 
{'2018': ['7,107', '4,394', '2,713'], '2017': ['16,478', '10,286', '6,192'], 
 '2016': ['15,944', '9,971', '5,973'], '2015': ['15,071', '9,079', '5,992'], 
 '2014': ['14,415', '8,596', '5,819'], '2013': ['14,259', '8,269', '5,990'], 
 '2012': ['14,010', '8,143', '5,867'], '2011': ['14,149', '8,126', '6,023'], 
 '2010': ['14,505', '7,943', '6,562'], '2009': ['14,632', '8,022', '6,610'], 
 '2008': ['14,207', '7,989', '6,218'], '2007': ['14,400', '8,085', '6,315'], 
 '2006': ['14,750', '8,017', '6,733'], '2005': ['14,497', '7,593', '6,904'], 
 '2004': ['14,155', '7,150', '7,005'], '2003': ['13,285', '6,457', '6,828'], 
 '2002': ['12,821', '6,190', '6,631'], '2001': ['12,702', '6,080', '6,622'], 
 '2000': ['11,942', '5,985', '5,957'], '1999': ['10,872', '5,824', '5,048'], 
 '1998': ['10,362', '5,793', '4,569'], '1997': ['9,546', '5,479', '4,067'], 
 '1996': ['9,222', '5,418', '3,804'], '1995': ['8,859', '5,363', '3,496'], 
 '1994': ['8,203', '5,099', '3,104'], '1993': ['7,766', '4,861', '2,905'], 
 '1992': ['7,091', '4,520', '2,571'], '1991': ['6,953', '4,526', '2,427'], 
 '1990': ['6,632', '4,509', '2,123'], '1989': ['5,929', '4,011', '1,918'], 
 '1988': ['5,909', '4,080', '1,829']

}

# the dataframe
    1988   1989   1990   1991   1992   1993   1994   1995   1996   1997  \
0  5,909  5,929  6,632  6,953  7,091  7,766  8,203  8,859  9,222  9,546
1  4,080  4,011  4,509  4,526  4,520  4,861  5,099  5,363  5,418  5,479
2  1,829  1,918  2,123  2,427  2,571  2,905  3,104  3,496  3,804  4,067

   ...      2009    2010    2011    2012    2013    2014    2015    2016  \
0  ...    14,632  14,505  14,149  14,010  14,259  14,415  15,071  15,944
1  ...     8,022   7,943   8,126   8,143   8,269   8,596   9,079   9,971
2  ...     6,610   6,562   6,023   5,867   5,990   5,819   5,992   5,973

     2017   2018
0  16,478  7,107
1  10,286  4,394
2   6,192  2,713

[3 rows x 31 columns] 

编辑后:

import pandas as pd

data = ['2018\t  \t7,107\t4,394\t2,713', '2017\t \t16,478\t10,286\t6,192', 
        '2016\t  \t15,944\t9,971\t5,973', '2015\t \t15,071\t9,079\t5,992', 
        '2014\t  \t14,415\t8,596\t5,819', '2013\t \t14,259\t8,269\t5,990', 
        '2012\t  \t14,010\t8,143\t5,867', '2011\t \t14,149\t8,126\t6,023', 
        '2010\t  \t14,505\t7,943\t6,562', '2009\t \t14,632\t8,022\t6,610', 
        '2008\t  \t14,207\t7,989\t6,218', '2007\t \t14,400\t8,085\t6,315', 
        '2006\t  \t14,750\t8,017\t6,733', '2005\t \t14,497\t7,593\t6,904', 
        '2004\t  \t14,155\t7,150\t7,005', '2003\t \t13,285\t6,457\t6,828', 
        '2002\t  \t12,821\t6,190\t6,631', '2001\t \t12,702\t6,080\t6,622', 
        '2000\t  \t11,942\t5,985\t5,957', '1999\t \t10,872\t5,824\t5,048', 
        '2018\t   \t10,362\t5,793\t4,569', '2017\t \t9,546\t5,479\t4,067', 
        '2016\t  \t9,222\t5,418\t3,804', '2015\t \t8,859\t5,363\t3,496', 
        '2014\t  \t8,203\t5,099\t3,104', '2013\t \t7,766\t4,861\t2,905', 
        '2012\t  \t7,091\t4,520\t2,571', '2011\t \t6,953\t4,526\t2,427', 
        '2010\t  \t6,632\t4,509\t2,123', '2009\t \t5,929\t4,011\t1,918', 
        '2008\t  \t5,909\t4,080\t1,829']

# partition and clean the data
cleaned = [ [x.strip() for x in year.split("\t") if x.strip()] for year in data  ]

import pandas as pd
df = pd.DataFrame(cleaned,columns=['year', 'data1', 'data2', 'data3'])

print(df)

编辑后输出:

    year   data1   data2  data3
0   2018   7,107   4,394  2,713
1   2017  16,478  10,286  6,192
2   2016  15,944   9,971  5,973
3   2015  15,071   9,079  5,992
4   2014  14,415   8,596  5,819
5   2013  14,259   8,269  5,990
6   2012  14,010   8,143  5,867
7   2011  14,149   8,126  6,023
8   2010  14,505   7,943  6,562
9   2009  14,632   8,022  6,610
10  2008  14,207   7,989  6,218
11  2007  14,400   8,085  6,315
12  2006  14,750   8,017  6,733
13  2005  14,497   7,593  6,904
14  2004  14,155   7,150  7,005
15  2003  13,285   6,457  6,828
16  2002  12,821   6,190  6,631
17  2001  12,702   6,080  6,622
18  2000  11,942   5,985  5,957
19  1999  10,872   5,824  5,048
20  2018  10,362   5,793  4,569
21  2017   9,546   5,479  4,067
22  2016   9,222   5,418  3,804
23  2015   8,859   5,363  3,496
24  2014   8,203   5,099  3,104
25  2013   7,766   4,861  2,905
26  2012   7,091   4,520  2,571
27  2011   6,953   4,526  2,427
28  2010   6,632   4,509  2,123
29  2009   5,929   4,011  1,918
30  2008   5,909   4,080  1,829 

编辑:

cleaned = [ [x.strip() for x in year.split("\t") if x.strip()] for year in data  ]

与以下内容大致相同:

alsoCleaned = []
for year in data:
    part = []    # collect all parts of one string
    for x in year.split("\t"):  # split the one string
        partCleaned = x.strip()   # remove whitespaces from x
        if partCleaned :          # only if now got content
            part.append(partCleaned) # add to part
    alsoCleaned.append(part)    # done all parts  so add to big list
    part = []

print(alsoCleaned)

==>

[['2018', '7,107', '4,394', '2,713'], ['2017', '16,478', '10,286', '6,192'], 
 # .... and so on ...., 
 ['2008', '5,909', '4,080', '1,829']]

通过re模块和生成器表达式:

假设我们每年都有数据。你知道吗

In [60]: import re

In [61]: lst = ['2018\t  \t7,107\t4,394\t2,713', '2017\t \t16,478\t10,286\t6,192', '2016\t  \t15,944\t9,971\t5,973', '2015\t \t15,071\t9,079\t5,992', '
    ...: 2014\t  \t14,415\t8,596\t5,819', '2013\t \t14,259\t8,269\t5,990', '2012\t  \t14,010\t8,143\t5,867', '2011\t \t14,149\t8,126\t6,023', '2010\t  
    ...: \t14,505\t7,943\t6,562', '2009\t \t14,632\t8,022\t6,610', '2008\t  \t14,207\t7,989\t6,218', '2007\t \t14,400\t8,085\t6,315', '2006\t  \t14,750
    ...: \t8,017\t6,733', '2005\t \t14,497\t7,593\t6,904', '2004\t  \t14,155\t7,150\t7,005', '2003\t \t13,285\t6,457\t6,828', '2002\t  \t12,821\t6,190\
    ...: t6,631', '2001\t \t12,702\t6,080\t6,622', '2000\t  \t11,942\t5,985\t5,957', '1999\t \t10,872\t5,824\t5,048', '1998\t   \t10,362\t5,793\t4,569'
    ...: , '1997\t \t9,546\t5,479\t4,067', '1996\t  \t9,222\t5,418\t3,804', '1995\t \t8,859\t5,363\t3,496', '1994\t  \t8,203\t5,099\t3,104', '1993\t \t
    ...: 7,766\t4,861\t2,905', '1992\t  \t7,091\t4,520\t2,571', '1991\t \t6,953\t4,526\t2,427', '1990\t  \t6,632\t4,509\t2,123', '1989\t \t5,929\t4,011
    ...: \t1,918', '1988\t  \t5,909\t4,080\t1,829']

In [62]: pat = re.compile(r'[^\s]+')

In [63]: parsed = (pat.findall(i) for i in lst)

In [64]: df = pd.DataFrame({i[0] : i[1:] for i in parsed})

In [65]: df
Out[65]: 
    1988   1989   1990   1991   1992   1993   1994   1995   1996  ...      2010    2011    2012    2013    2014    2015    2016    2017   2018
0  5,909  5,929  6,632  6,953  7,091  7,766  8,203  8,859  9,222  ...    14,505  14,149  14,010  14,259  14,415  15,071  15,944  16,478  7,107
1  4,080  4,011  4,509  4,526  4,520  4,861  5,099  5,363  5,418  ...     7,943   8,126   8,143   8,269   8,596   9,079   9,971  10,286  4,394
2  1,829  1,918  2,123  2,427  2,571  2,905  3,104  3,496  3,804  ...     6,562   6,023   5,867   5,990   5,819   5,992   5,973   6,192  2,713

[3 rows x 31 columns]

相关问题 更多 >