<p>它似乎是一个bug,在<code>pandas/io/stat.py</code>的源代码中,在<code>_do_select_columns()</code>方法中,循环:</p>
<pre><code>dtyplist = []
typlist = []
fmtlist = []
lbllist = []
matched = set()
for i, col in enumerate(data.columns):
if col in column_set:
matched.update([col])
dtyplist.append(self.dtyplist[i])
typlist.append(self.typlist[i])
fmtlist.append(self.fmtlist[i])
lbllist.append(self.lbllist[i])
</code></pre>
<p>打乱了<code>dtypes</code>的顺序,它不再与<code>column_set</code>中出现的序列匹配。你知道吗</p>
<p>比较本例中<code>df2</code>和<code>df3</code>的<code>dtypes</code>:</p>
<pre><code>In [1]:
import zipfile
z = zipfile.ZipFile('/Users/q6600sl/Downloads/cepr_org_2014.zip')
df= pd.read_stata(z.open('cepr_org_2014.dta'), convert_categoricals = False)
In [2]:
columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
In [3]:
df2 = pd.read_stata(z.open('cepr_org_2014.dta'),
convert_categoricals = False,
columns=columns+columns2)
In [4]:
df2.dtypes
Out[4]:
wbho int16
age int8
female int8
wage4 object
ind_nber object
year float32
month int8
minsamp int8
hhid float64
hhid2 float64
fnlwgt float32
dtype: object
In [5]:
df3 = df[columns+columns2]
In [6]:
df3.dtypes
Out[6]:
wbho int8
age int8
female int8
wage4 float32
ind_nber float64
year int16
month int8
minsamp int8
hhid object
hhid2 object
fnlwgt float32
dtype: object
</code></pre>
<p>更改为:</p>
<pre><code>dtyplist = []
typlist = []
fmtlist = []
lbllist = []
#matched = set()
for i in np.hstack([np.argwhere(data.columns==col) for col in columns]).ravel():
# if col in column_set:
# matched.update([col])
dtyplist.append(self.dtyplist[i])
typlist.append(self.typlist[i])
fmtlist.append(self.fmtlist[i])
lbllist.append(self.lbllist[i])
</code></pre>
<p>修复了问题。你知道吗</p>
<p>(不知道<code>matched</code>在这里做什么。以后似乎再也不用了。)</p>