Pandas：在过滤器中添加列会弄乱数据结构

>>> df = pd.read_stata('cepr_org_2014.dta', convert_categoricals = False) >>> df.iloc[0] year 2014 month 1 minsamp 8 hhid 000936071123039 hhid2 91001 # [...] >>> df.iloc[0]['wage4'] nan

>>> columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber'] columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt'] >>> df = pd.read_stata('cepr_org_2014.dta', convert_categoricals = False, columns=columns+columns2) >>> df.iloc[0] wbho 1 age 65 female 0 wage4 1.7014118346e+38 ind_nber 101 year 2014 month 1 minsamp 8 hhid NaN hhid2 NaN fnlwgt 560.1073 Name: 0, dtype: object

2条回答

网友

1楼 · 编辑于 2024-10-04 05:28:37

我把这个错误追溯到熊猫的一个虫子身上。我已经修复了https://github.com/jbuyl/pandas/tree/fix-column-dtype-mixing中的错误，并打开了一个pull请求以在修复中合并，但是可以随意签出我的fork/branch。你知道吗

下面是运行您的示例的结果：

>>> columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
>>> columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
>>> df = pd.read_stata('cepr_org_2014.dta',
...     convert_categoricals = False,
...     columns=columns+columns2)
>>> df.iloc[0]
wbho                      1
age                      65
female                    0
wage4                   nan
ind_nber                NaN
year                   2014
month                     1
minsamp                   8
hhid        000936071123039
hhid2                 91001
fnlwgt              560.107
Name: 0, dtype: object

网友

2楼 · 编辑于 2024-10-04 05:28:37

它似乎是一个bug，在pandas/io/stat.py的源代码中，在_do_select_columns()方法中，循环：

dtyplist = []
typlist = []
fmtlist = []
lbllist = []
matched = set()
for i, col in enumerate(data.columns):
    if col in column_set:
        matched.update([col])
        dtyplist.append(self.dtyplist[i])
        typlist.append(self.typlist[i])
        fmtlist.append(self.fmtlist[i])
        lbllist.append(self.lbllist[i])

打乱了dtypes的顺序，它不再与column_set中出现的序列匹配。你知道吗

比较本例中df2和df3的dtypes：

In [1]:

import zipfile
z = zipfile.ZipFile('/Users/q6600sl/Downloads/cepr_org_2014.zip')
df= pd.read_stata(z.open('cepr_org_2014.dta'), convert_categoricals = False)
In [2]:

columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
In [3]:

df2 = pd.read_stata(z.open('cepr_org_2014.dta'),
                    convert_categoricals = False,
                    columns=columns+columns2)
In [4]:

df2.dtypes
Out[4]:
wbho          int16
age            int8
female         int8
wage4        object
ind_nber     object
year        float32
month          int8
minsamp        int8
hhid        float64
hhid2       float64
fnlwgt      float32
dtype: object
In [5]:

df3 = df[columns+columns2]
In [6]:

df3.dtypes
Out[6]:
wbho           int8
age            int8
female         int8
wage4       float32
ind_nber    float64
year          int16
month          int8
minsamp        int8
hhid         object
hhid2        object
fnlwgt      float32
dtype: object

更改为：

dtyplist = []
typlist = []
fmtlist = []
lbllist = []
#matched = set()
for i in np.hstack([np.argwhere(data.columns==col) for col in columns]).ravel():
#    if col in column_set:
#        matched.update([col])
    dtyplist.append(self.dtyplist[i])
    typlist.append(self.typlist[i])
    fmtlist.append(self.fmtlist[i])
    lbllist.append(self.lbllist[i])

修复了问题。你知道吗

（不知道matched在这里做什么。以后似乎再也不用了。）

相关问题更多 >

编程相关推荐

热门问题

热门文章