Pandas：基于预定义类别的字符串列创建二进制列（虚拟/onehot编码表）

Item Description Red Blue Yellow Pink Shirt Skirt R2G1 RED, BLUE, SHIRT 1 1 0 0 1 0 G23A YELLOW SHIRT 0 0 1 0 1 0 P001 BLUE, PINK SKIRT 0 1 0 1 0 1

def get_category(series): res = [] for i in category_list: if i in series.upper(): res.append(i) return res df['Categories'] = df['Description'].apply(get_model) df = df.join(df['Model'].str.join('|').str.get_dummies())

3条回答

网友

1楼 · 编辑于 2024-09-24 22:29:08

如果使项具有统一的分隔符，则可以使用^{}的sep参数来获取一个热编码。为此，我将空格替换为逗号：

>>> df['Description'].str.replace(' ', ',').str.get_dummies(sep=',')
   BLUE  PINK  RED  SHIRT  SKIRT  YELLOW
0     1     0    1      1      0       0
1     0     0    0      1      0       1
2     1     1    0      0      1       0

那么你只需要加入：

>>> df.join(df['Description'].str.replace(' ', ',').str.get_dummies(sep=','))
   Item       Description  BLUE  PINK  RED  SHIRT  SKIRT  YELLOW
0  R2G1  RED, BLUE, SHIRT     1     0    1      1      0       0
1  G23A      YELLOW SHIRT     0     0    0      1      0       1
2  P001  BLUE, PINK SKIRT     1     1    0      0      1       0

但是需要注意的是（正如Rob所评论的），这是从Description列确定类别，而不是从categories列表本身确定类别。因此，如果您有不在categories中的描述，您将有额外的列。例如，如果一个描述包含"GREEN"，您将得到整个绿色列

同样，描述中不存在的类别也不会作为列包含。因此，如果说第一行丢失了，那么RED就没有列了

如果这是一个问题，我可以想办法修复这些行为，但我认为更简单的方法是使用heretolearn's answer或其他明确包含categories的方法

网友

2楼 · 编辑于 2024-09-24 22:29:08

您可以尝试以下方法：

import pandas as pd
import numpy as np

categories = ['RED', 'BLUE', 'YELLOW', 'PINK', 'SHIRT', 'SKIRT']

def categorize(df, categories):
    for category in categories:
        df[category] = np.where(df.Description.str.contains(category), 1, 0)
    return df 

df = categorize(df, categories)

输出：

^{tb1}$

网友

3楼 · 编辑于 2024-09-24 22:29:08

使用findall和MultiLabelBinarizer的另一种方法

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
f = df['Description'].str.findall('|'.join(categories))
out = df.join(pd.DataFrame(mlb.fit_transform(f),columns=mlb.classes_, index=df.index))

在findall之后series.str.get_dummies的更慢但更简单的版本只有在加入它们之后：

out = df.join(df['Description'].str.findall('|'.join(categories))
                         .str.join('|').str.get_dummies())

print(out)



   Item       Description  BLUE  PINK  RED  SHIRT  SKIRT  YELLOW
0  R2G1  RED, BLUE, SHIRT     1     0    1      1      0       0
1  G23A      YELLOW SHIRT     0     0    0      1      0       1
2  P001  BLUE, PINK SKIRT     1     1    0      0      1       0

相关问题更多 >

编程相关推荐

热门问题

热门文章