<p>另一种方法是不使用regex(但不如<a href="https://stackoverflow.com/a/51221288/7505395">Romans answer</a>整洁),使用列表理解清理数据,然后放入dict中,从中创建数据帧:</p>
<pre><code>data = ['2018\t \t7,107\t4,394\t2,713', '2017\t \t16,478\t10,286\t6,192',
'2016\t \t15,944\t9,971\t5,973', '2015\t \t15,071\t9,079\t5,992',
'2014\t \t14,415\t8,596\t5,819', '2013\t \t14,259\t8,269\t5,990',
'2012\t \t14,010\t8,143\t5,867', '2011\t \t14,149\t8,126\t6,023',
'2010\t \t14,505\t7,943\t6,562', '2009\t \t14,632\t8,022\t6,610',
'2008\t \t14,207\t7,989\t6,218', '2007\t \t14,400\t8,085\t6,315',
'2006\t \t14,750\t8,017\t6,733', '2005\t \t14,497\t7,593\t6,904',
'2004\t \t14,155\t7,150\t7,005', '2003\t \t13,285\t6,457\t6,828',
'2002\t \t12,821\t6,190\t6,631', '2001\t \t12,702\t6,080\t6,622',
'2000\t \t11,942\t5,985\t5,957', '1999\t \t10,872\t5,824\t5,048',
'1998\t \t10,362\t5,793\t4,569', '1997\t \t9,546\t5,479\t4,067',
'1996\t \t9,222\t5,418\t3,804', '1995\t \t8,859\t5,363\t3,496',
'1994\t \t8,203\t5,099\t3,104', '1993\t \t7,766\t4,861\t2,905',
'1992\t \t7,091\t4,520\t2,571', '1991\t \t6,953\t4,526\t2,427',
'1990\t \t6,632\t4,509\t2,123', '1989\t \t5,929\t4,011\t1,918',
'1988\t \t5,909\t4,080\t1,829']
# partition and clean the data
cleaned = [ [x.strip() for x in year.split("\t") if x.strip()] for year in data ]
# make a dict
dataCleaned = {x:y for x,*y in cleaned}
print (dataCleaned)
import pandas as pd
df = pd.DataFrame(dataCleaned)
print(df)
</code></pre>
<p>输出:</p>
<pre><code># the dict
{'2018': ['7,107', '4,394', '2,713'], '2017': ['16,478', '10,286', '6,192'],
'2016': ['15,944', '9,971', '5,973'], '2015': ['15,071', '9,079', '5,992'],
'2014': ['14,415', '8,596', '5,819'], '2013': ['14,259', '8,269', '5,990'],
'2012': ['14,010', '8,143', '5,867'], '2011': ['14,149', '8,126', '6,023'],
'2010': ['14,505', '7,943', '6,562'], '2009': ['14,632', '8,022', '6,610'],
'2008': ['14,207', '7,989', '6,218'], '2007': ['14,400', '8,085', '6,315'],
'2006': ['14,750', '8,017', '6,733'], '2005': ['14,497', '7,593', '6,904'],
'2004': ['14,155', '7,150', '7,005'], '2003': ['13,285', '6,457', '6,828'],
'2002': ['12,821', '6,190', '6,631'], '2001': ['12,702', '6,080', '6,622'],
'2000': ['11,942', '5,985', '5,957'], '1999': ['10,872', '5,824', '5,048'],
'1998': ['10,362', '5,793', '4,569'], '1997': ['9,546', '5,479', '4,067'],
'1996': ['9,222', '5,418', '3,804'], '1995': ['8,859', '5,363', '3,496'],
'1994': ['8,203', '5,099', '3,104'], '1993': ['7,766', '4,861', '2,905'],
'1992': ['7,091', '4,520', '2,571'], '1991': ['6,953', '4,526', '2,427'],
'1990': ['6,632', '4,509', '2,123'], '1989': ['5,929', '4,011', '1,918'],
'1988': ['5,909', '4,080', '1,829']
</code></pre>
<p>}</p>
<pre><code># the dataframe
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 \
0 5,909 5,929 6,632 6,953 7,091 7,766 8,203 8,859 9,222 9,546
1 4,080 4,011 4,509 4,526 4,520 4,861 5,099 5,363 5,418 5,479
2 1,829 1,918 2,123 2,427 2,571 2,905 3,104 3,496 3,804 4,067
... 2009 2010 2011 2012 2013 2014 2015 2016 \
0 ... 14,632 14,505 14,149 14,010 14,259 14,415 15,071 15,944
1 ... 8,022 7,943 8,126 8,143 8,269 8,596 9,079 9,971
2 ... 6,610 6,562 6,023 5,867 5,990 5,819 5,992 5,973
2017 2018
0 16,478 7,107
1 10,286 4,394
2 6,192 2,713
[3 rows x 31 columns]
</code></pre>
<hr/>
<p>编辑后:</p>
<pre><code>import pandas as pd
data = ['2018\t \t7,107\t4,394\t2,713', '2017\t \t16,478\t10,286\t6,192',
'2016\t \t15,944\t9,971\t5,973', '2015\t \t15,071\t9,079\t5,992',
'2014\t \t14,415\t8,596\t5,819', '2013\t \t14,259\t8,269\t5,990',
'2012\t \t14,010\t8,143\t5,867', '2011\t \t14,149\t8,126\t6,023',
'2010\t \t14,505\t7,943\t6,562', '2009\t \t14,632\t8,022\t6,610',
'2008\t \t14,207\t7,989\t6,218', '2007\t \t14,400\t8,085\t6,315',
'2006\t \t14,750\t8,017\t6,733', '2005\t \t14,497\t7,593\t6,904',
'2004\t \t14,155\t7,150\t7,005', '2003\t \t13,285\t6,457\t6,828',
'2002\t \t12,821\t6,190\t6,631', '2001\t \t12,702\t6,080\t6,622',
'2000\t \t11,942\t5,985\t5,957', '1999\t \t10,872\t5,824\t5,048',
'2018\t \t10,362\t5,793\t4,569', '2017\t \t9,546\t5,479\t4,067',
'2016\t \t9,222\t5,418\t3,804', '2015\t \t8,859\t5,363\t3,496',
'2014\t \t8,203\t5,099\t3,104', '2013\t \t7,766\t4,861\t2,905',
'2012\t \t7,091\t4,520\t2,571', '2011\t \t6,953\t4,526\t2,427',
'2010\t \t6,632\t4,509\t2,123', '2009\t \t5,929\t4,011\t1,918',
'2008\t \t5,909\t4,080\t1,829']
# partition and clean the data
cleaned = [ [x.strip() for x in year.split("\t") if x.strip()] for year in data ]
import pandas as pd
df = pd.DataFrame(cleaned,columns=['year', 'data1', 'data2', 'data3'])
print(df)
</code></pre>
<p>编辑后输出:</p>
<pre><code> year data1 data2 data3
0 2018 7,107 4,394 2,713
1 2017 16,478 10,286 6,192
2 2016 15,944 9,971 5,973
3 2015 15,071 9,079 5,992
4 2014 14,415 8,596 5,819
5 2013 14,259 8,269 5,990
6 2012 14,010 8,143 5,867
7 2011 14,149 8,126 6,023
8 2010 14,505 7,943 6,562
9 2009 14,632 8,022 6,610
10 2008 14,207 7,989 6,218
11 2007 14,400 8,085 6,315
12 2006 14,750 8,017 6,733
13 2005 14,497 7,593 6,904
14 2004 14,155 7,150 7,005
15 2003 13,285 6,457 6,828
16 2002 12,821 6,190 6,631
17 2001 12,702 6,080 6,622
18 2000 11,942 5,985 5,957
19 1999 10,872 5,824 5,048
20 2018 10,362 5,793 4,569
21 2017 9,546 5,479 4,067
22 2016 9,222 5,418 3,804
23 2015 8,859 5,363 3,496
24 2014 8,203 5,099 3,104
25 2013 7,766 4,861 2,905
26 2012 7,091 4,520 2,571
27 2011 6,953 4,526 2,427
28 2010 6,632 4,509 2,123
29 2009 5,929 4,011 1,918
30 2008 5,909 4,080 1,829
</code></pre>
<hr/>
<p>编辑:</p>
<pre><code>cleaned = [ [x.strip() for x in year.split("\t") if x.strip()] for year in data ]
</code></pre>
<p>与以下内容大致相同:</p>
<pre><code>alsoCleaned = []
for year in data:
part = [] # collect all parts of one string
for x in year.split("\t"): # split the one string
partCleaned = x.strip() # remove whitespaces from x
if partCleaned : # only if now got content
part.append(partCleaned) # add to part
alsoCleaned.append(part) # done all parts so add to big list
part = []
print(alsoCleaned)
</code></pre>
<p>==></p>
<pre><code>[['2018', '7,107', '4,394', '2,713'], ['2017', '16,478', '10,286', '6,192'],
# .... and so on ....,
['2008', '5,909', '4,080', '1,829']]
</code></pre>