OneHot使用不在列中的元素对列进行编码

2024-07-07 09:03:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我的数据帧:

Index letters
0     A
1     B
2     D
3     Z

在Python中,我希望获得上面字母列的一个热编码数据帧,其中包含不在该列中的元素,如下所示:

Index A B C D E K Z
0     1 0 0 0 0 0 0
1     0 1 0 0 0 0 0
2     0 0 0 1 0 0 0
3     0 0 0 0 0 0 1

Tags: 数据元素index字母letters编码数据
3条回答

使用merge

df = pd.DataFrame({'Letters':['A','B', 'D', 'Z']})
all_letters = ['A','B', 'C', 'D','E','K', 'Z']
s = pd.get_dummies(all_letters)
s['Letters'] = all_letters
df2 = df.merge(s, on='Letters')
df2

给予

|    | Letters   |   A |   B |   C |   D |   E |   K |   Z |
| -:|:     |  :|  :|  :|  :|  :|  :|  :|
|  0 | A         |   1 |   0 |   0 |   0 |   0 |   0 |   0 |
|  1 | B         |   0 |   1 |   0 |   0 |   0 |   0 |   0 |
|  2 | D         |   0 |   0 |   0 |   1 |   0 |   0 |   0 |
|  3 | Z         |   0 |   0 |   0 |   0 |   0 |   0 |   1 |

为此使用get_dummies

df = pd.get_dummies(df)
df.columns = df.columns.str.replace('letters_', '')
print(df)


   Index  A  B  D  Z
0      0  1  0  0  0
1      1  0  1  0  0
2      2  0  0  1  0
3      3  0  0  0  1
import pandas as pd

df = pd.DataFrame(["A", "A", "C", "C", "E", "F", "G"], columns=['letters'])

all_cats = ["A", "B", "C", "D", "E", "F", "G"]
ohe = pd.get_dummies(df['letters'], sparse=True).reindex(all_cats, axis=1, fill_value=0)

>>> ohe
   A  B  C  D  E  F  G
0  1  0  0  0  0  0  0
1  1  0  0  0  0  0  0
2  0  0  1  0  0  0  0
3  0  0  1  0  0  0  0
4  0  0  0  0  1  0  0
5  0  0  0  0  0  1  0
6  0  0  0  0  0  0  1

相关问题 更多 >