onehotencoder的sklearn掩码不

categorical_features: “all” or array of indices or mask : Specify what features are treated as categorical. ‘all’ (default): All features are treated as categorical. array of indices: Array of categorical feature indices. mask: Array of length n_features and with dtype=bool.

3条回答

网友

1楼 · 编辑于 2024-05-18 07:33:19

我想这里有些混乱。您仍然需要输入数值，但是在encoder中，您可以指定哪些值是分类的，哪些不是。

The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.

所以在下面的例子中，我将aaa更改为5，将bbb更改为6。这样它将区别于1和2数值：

d = np.array([[5, 1, 1], [6, 2, 2]])
ohe = OneHotEncoder(categorical_features=np.array([True,False,False], dtype=bool))
ohe.fit(d)

现在您可以检查功能类别：

ohe.active_features_
Out[22]: array([5, 6], dtype=int64)

网友

2楼 · 编辑于 2024-05-18 07:33:19

您应该知道，Scikit Learn中的所有估计器都是为数值输入而设计的。因此，从这个角度来看，在这个表单中保留文本列是没有意义的。您必须将该文本列转换为数字形式，或者将其删除。

如果您从Pandas DataFrame获得了数据集，那么可以看看这个小包装：https://github.com/paulgb/sklearn-pandas。它将帮助您同时转换所有需要的列（或以数字形式保留一些行）

import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame({'text':['aaa', 'bbb'], 'number_1':[1, 1], 'number_2':[2, 2]})

#    number_1  number_2 text
# 0         1         2  aaa
# 1         1         2  bbb

# SomeEncoder here must be any encoder which will help you to get
# numerical representation from text column
mapper = DataFrameMapper([
    ('text', SomeEncoder),
    (['number_1', 'number_2'], OneHotEncoder())
])
mapper.fit_transform(data)

网友

3楼 · 编辑于 2024-05-18 07:33:19

我也遇到过同样的行为，觉得很沮丧。正如其他人指出的，Scikit Learn在考虑选择categorical_features参数中提供的列之前，要求所有数据都是数值型的。

具体来说，列选择由^{}中的_transform_selected()方法处理，该方法的第一行是

X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)。

如果所提供数据帧X中的任何数据都无法成功转换为浮点，则此检查将失败。

我同意^{}的文件在这方面具有误导性。

相关问题更多 >

编程相关推荐

热门问题

热门文章