用“OTHER”python重命名不太频繁的类别

2024-10-01 02:18:25 发布

您现在位置:Python中文网/ 问答频道 /正文

在我的数据框中,我有超过100个不同类别的分类列。我想按最常见的分类。我保留前9个最频繁的类别,而不太频繁的类别会自动将它们重命名为:OTHER

示例:

我的数据框:

print(df)

    Employee_number                 Jobrol
0                 1        Sales Executive
1                 2     Research Scientist
2                 3  Laboratory Technician
3                 4        Sales Executive
4                 5     Research Scientist
5                 6  Laboratory Technician
6                 7        Sales Executive
7                 8     Research Scientist
8                 9  Laboratory Technician
9                10        Sales Executive
10               11     Research Scientist
11               12  Laboratory Technician
12               13        Sales Executive
13               14     Research Scientist
14               15  Laboratory Technician
15               16        Sales Executive
16               17     Research Scientist
17               18     Research Scientist
18               19                Manager
19               20        Human Resources
20               21        Sales Executive


valCount = df['Jobrol'].value_counts()

valCount

Sales Executive          7
Research Scientist       7
Laboratory Technician    5
Manager                  1
Human Resources          1

我保留前3个类别,然后我将其余的重命名为“其他”,我应该如何继续?在

谢谢。在


Tags: 数据dfmanager分类类别重命名resourceslaboratory
2条回答

^{}^{}一起使用:

need = df['Jobrol'].value_counts().index[:3]
df['Jobrol'] = np.where(df['Jobrol'].isin(need), df['Jobrol'], 'OTHER')

valCount = df['Jobrol'].value_counts()
print (valCount)
Research Scientist       7
Sales Executive          7
Laboratory Technician    5
OTHER                    2
Name: Jobrol, dtype: int64

另一种解决方案:

^{pr2}$

将序列转换为分类,提取计数不在前3位的类别,添加新类别,例如'Other',然后替换先前计算的类别:

df['Jobrol'] = df['Jobrol'].astype('category')

others = df['Jobrol'].value_counts().index[3:]
label = 'Other'

df['Jobrol'] = df['Jobrol'].cat.add_categories([label])
df['Jobrol'] = df['Jobrol'].replace(others, label)

注意:通过df['Jobrol'].cat.rename_categories(dict.fromkeys(others, label))重命名来组合类别是很有诱惑力的,但这不起作用,因为这意味着有多个相同标签的类别,这是不可能的。在


上述溶液可根据计数进行过滤。例如,要只包含计数为1的类别,可以将others定义为:

^{pr2}$

相关问题 更多 >