Pandas:在过滤器中添加列会弄乱数据结构

2024-10-04 05:28:37 发布

您现在位置:Python中文网/ 问答频道 /正文

考虑这个zip file后面的.dta文件。你知道吗

这是第一排:

>>> df = pd.read_stata('cepr_org_2014.dta', convert_categoricals = False)
>>> df.iloc[0]
year                   2014
month                     1
minsamp                   8
hhid        000936071123039
hhid2                 91001
# [...]
>>> df.iloc[0]['wage4']
nan

我用stata再次检查这个,它看起来是正确的。到目前为止,还不错。现在我设置了一些要保留的列并重做练习。你知道吗

>>> columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
>>> df = pd.read_stata('cepr_org_2014.dta',
    convert_categoricals = False,
    columns=columns+columns2)
>>> df.iloc[0]
wbho                       1
age                       65
female                     0
wage4       1.7014118346e+38
ind_nber                 101
year                    2014
month                      1
minsamp                    8
hhid                     NaN
hhid2                    NaN
fnlwgt              560.1073
Name: 0, dtype: object

添加要保留的列列表后,pandas

  • 不再理解缺少的值,wage4NaN大。你知道吗
  • hhidhhid2创建缺少的值。你知道吗

为什么?你知道吗

注意:首先加载数据集,然后使用df[columns+columns2]进行过滤。


Tags: columnsdfreadnanyearpdstatamonth
2条回答

我把这个错误追溯到熊猫的一个虫子身上。我已经修复了https://github.com/jbuyl/pandas/tree/fix-column-dtype-mixing中的错误,并打开了一个pull请求以在修复中合并,但是可以随意签出我的fork/branch。你知道吗

下面是运行您的示例的结果:

>>> columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
>>> columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
>>> df = pd.read_stata('cepr_org_2014.dta',
...     convert_categoricals = False,
...     columns=columns+columns2)
>>> df.iloc[0]
wbho                      1
age                      65
female                    0
wage4                   nan
ind_nber                NaN
year                   2014
month                     1
minsamp                   8
hhid        000936071123039
hhid2                 91001
fnlwgt              560.107
Name: 0, dtype: object

它似乎是一个bug,在pandas/io/stat.py的源代码中,在_do_select_columns()方法中,循环:

dtyplist = []
typlist = []
fmtlist = []
lbllist = []
matched = set()
for i, col in enumerate(data.columns):
    if col in column_set:
        matched.update([col])
        dtyplist.append(self.dtyplist[i])
        typlist.append(self.typlist[i])
        fmtlist.append(self.fmtlist[i])
        lbllist.append(self.lbllist[i])

打乱了dtypes的顺序,它不再与column_set中出现的序列匹配。你知道吗

比较本例中df2df3dtypes

In [1]:

import zipfile
z = zipfile.ZipFile('/Users/q6600sl/Downloads/cepr_org_2014.zip')
df= pd.read_stata(z.open('cepr_org_2014.dta'), convert_categoricals = False)
In [2]:

columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
In [3]:

df2 = pd.read_stata(z.open('cepr_org_2014.dta'),
                    convert_categoricals = False,
                    columns=columns+columns2)
In [4]:

df2.dtypes
Out[4]:
wbho          int16
age            int8
female         int8
wage4        object
ind_nber     object
year        float32
month          int8
minsamp        int8
hhid        float64
hhid2       float64
fnlwgt      float32
dtype: object
In [5]:

df3 = df[columns+columns2]
In [6]:

df3.dtypes
Out[6]:
wbho           int8
age            int8
female         int8
wage4       float32
ind_nber    float64
year          int16
month          int8
minsamp        int8
hhid         object
hhid2        object
fnlwgt      float32
dtype: object

更改为:

dtyplist = []
typlist = []
fmtlist = []
lbllist = []
#matched = set()
for i in np.hstack([np.argwhere(data.columns==col) for col in columns]).ravel():
#    if col in column_set:
#        matched.update([col])
    dtyplist.append(self.dtyplist[i])
    typlist.append(self.typlist[i])
    fmtlist.append(self.fmtlist[i])
    lbllist.append(self.lbllist[i])

修复了问题。你知道吗

(不知道matched在这里做什么。以后似乎再也不用了。)

相关问题 更多 >