使用scikitlearn处理太多的分类功能

2024-10-04 15:26:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我对scikit learn还很陌生,我正试图用这个软件包来预测收入数据。 这可能是一个重复的问题,因为我看到了另一个关于这个的帖子,但我正在寻找一个简单的例子来理解什么是从scikit learn estimators。在

我拥有的数据结构如下,其中许多特征是分类的(例如:工人阶级、教育……)

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

示例记录:

^{pr2}$

我很难处理分类特性,因为sckit中的大多数模型都希望所有特性都是数字? 他们确实提供了一些类来转换/编码这些特性(比如Onehotencoder、DictVectorizer),但我找不到在数据中使用这些特性的方法。我知道在我把它们完全编码成数字之前,有很多步骤要做,但是我想知道有没有人知道一种更简单有效的方法(因为有太多这样的特性)可以用一个例子来理解。 我模糊地知道DictVectorizer是一个好办法,但需要帮助如何继续在这里。在


Tags: 数据self分类特性scikitlearn例子inc
1条回答
网友
1楼 · 发布于 2024-10-04 15:26:29

下面是一些使用DictVectorizer的示例代码。首先,让我们在pythonshell中设置一些数据。我把文件的阅读交给你。在

>>> features = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
...             "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country"]
>>> input_text = """38   Private    215646   HS-grad    9    Divorced    Handlers-cleaners   Not-in-family   White   Male   0   0   40   United-States   <=50K
... 53   Private    234721   11th   7    Married-civ-spouse  Handlers-cleaners   Husband     Black   Male   0   0   40   United-States   <=50K
... 30   State-gov  141297   Bachelors  13   Married-civ-spouse  Prof-specialty  Husband     Asian-Pac-Islander  Male   0   0   40   India   >50K
... """

现在,分析一下:

^{pr2}$

我们现在有什么?让我们检查一下:

>>> from pprint import pprint
>>> pprint(samples[0])
{'age': '38',
 'capital-gain': '0',
 'capital-loss': '0',
 'education': 'HS-grad',
 'education-num': '9',
 'fnlwgt': '215646',
 'hours-per-week': '40',
 'marital-status': 'Divorced',
 'native-country': 'United-States',
 'occupation': 'Handlers-cleaners',
 'race': 'White',
 'relationship': 'Not-in-family',
 'sex': 'Male',
 'workclass': 'Private'}
>>> print(y)
['<=50K', '<=50K', '>50K']

这些samples已经为DictVectorizer准备好了,所以请传递它们:

>>> from sklearn.feature_extraction import DictVectorizer
>>> dv = DictVectorizer()
>>> X = dv.fit_transform(samples)
>>> X
<3x29 sparse matrix of type '<type 'numpy.float64'>'
        with 42 stored elements in Compressed Sparse Row format>

等等,你有{}和{}可以传递给估计器,前提是它支持稀疏矩阵。(否则,将sparse=False传递给DictVectorizer构造函数。)

类似地,测试样本可以传递给DictVectorizer.transform;如果测试集中有一些特征/值组合没有出现在训练集中,这些组合将被忽略(因为学习的模型无论如何都无法对它们做任何有意义的事情)。在

相关问题 更多 >

    热门问题