将规范化整数值更改为类别以进行分类

Serial No. 0 GRE Score 0 TOEFL Score 0 University Rating 0 SOP 0 LOR 0 CGPA 0 Research 0 Chance of Admit 0 dtype: int64 0: 1 337 118 4 4.5 4.5 9.65 1 0.92 1: 2 324 107 4 4.0 4.5 8.87 1 0.76

2条回答

网友

1楼 · 编辑于 2024-06-01 07:29:57

既然它们是“标准化”值…为什么需要对它们进行分类？一个简单的阈值应该正常工作

即。 0-0.33低 0.33-0.66中等 0.66-1.0偏高

如果您的类别数量不断变化，您希望使用自动方法的唯一原因可能是

要进行分类，您可以使用pandas来进行分类，但您需要确定箱子（类别）的范围和数量。从文件来看，我认为这应该行得通

In [6]: df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})

In [7]: labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]

In [8]: df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)

In [9]: df.head(10)
Out[9]: 
   value    group
0     65  60 - 69
1     49  40 - 49
2     56  50 - 59
3     43  40 - 49
4     43  40 - 49
5     91  90 - 99
6     32  30 - 39
7     87  80 - 89
8     36  30 - 39
9      8    0 - 9

然后，您可以将df['group']替换为您的“接纳机会”列，并根据存储箱的数量，通过阈值或自动填充离散存储箱的必要范围

供参考：

https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

网友

2楼 · 编辑于 2024-06-01 07:29:57

IIUC，您希望基于范围将连续变量映射到分类值，例如：

0.96 -> high, 
0.31 -> low
...

pandas提供了一个函数，从文档中可以看出cut：

Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable.

设置

   Serial No.  GRE Score  TOEFL Score  ...  CGPA  Research  Chance of Admit
0           1        337          118  ...  9.65         1             0.92
1           2        324          107  ...  8.87         1             0.76
2           2        324          107  ...  8.87         1             0.31
3           2        324          107  ...  8.87         1             0.45

[4 rows x 9 columns]

假设采用上述设置，您可以像这样使用cut：

labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(labels)

输出

0      high
1      high
2       low
3    medium
Name: Chance of Admit, dtype: category
Categories (3, object): [low < medium < high]

注意，我们使用了3个容器：[(0, 0.33], (0.33, 0.66], (0.66, 1.0]]，列Chance of Admit的值是[0.92, 0.76, 0.31, 0.45]。如果要更改标签名称，只需更改labels参数的值，例如：labels=['unlikely', 'doable', 'likely']。如果需要序号值，请执行以下操作：

labels = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=list(range(3)))
print(labels)

输出

0    2
1    2
2    0
3    1
Name: Chance of Admit, dtype: category
Categories (3, int64): [0 < 1 < 2]

最后，要将所有内容放在透视图中，您可以执行以下操作将其添加到数据帧中：

df['group'] = pd.cut(df['Chance of Admit'], [0, 0.33, 0.66, 1.0], labels=['low', 'medium', 'high'])
print(df)

输出

   Serial No.  GRE Score  TOEFL Score  ...  Research  Chance of Admit   group
0           1        337          118  ...         1             0.92    high
1           2        324          107  ...         1             0.76    high
2           2        324          107  ...         1             0.31     low
3           2        324          107  ...         1             0.45  medium

[4 rows x 10 columns]

相关问题更多 >

编程相关推荐

热门问题

热门文章