如何获得scikit learn分类器最丰富的功能？

viagra = None ok : spam = 4.5 : 1.0 hello = True ok : spam = 4.5 : 1.0 hello = None spam : ok = 3.3 : 1.0 viagra = True spam : ok = 3.3 : 1.0 casino = True spam : ok = 2.0 : 1.0 casino = None ok : spam = 1.5 : 1.0

3条回答

网友

1楼 · 编辑于 2024-09-22 16:39:05

要添加更新，RandomForestClassifier现在支持.feature_importances_属性。这个attribute告诉您观察到的方差有多少是由这个特性解释的。显然，所有这些值的和必须是<；=1。

我发现这个属性在执行特征工程时非常有用。

感谢scikit学习团队和贡献者实现了这一点！

编辑：这对RandomForest和GradientBoosting都有效。所以RandomForestClassifier、RandomForestRegressor、GradientBoostingClassifier和GradientBoostingRegressor都支持这一点。

网友

2楼 · 编辑于 2024-09-22 16:39:05

分类器本身不记录要素名称，只看到数字数组。但是，如果使用Vectorizer/CountVectorizer/TfidfVectorizer/DictVectorizer，和提取特征，则使用线性模型（例如LinearSVC或Naive Bayes），则可以应用document classification example使用的相同技巧。示例（未测试的可能包含一个或两个错误）：

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))

这是用于多类分类的；对于二进制情况，我认为应该只使用clf.coef_[0]。您可能需要对class_labels进行排序。

网友

3楼 · 编辑于 2024-09-22 16:39:05

在larsmans代码的帮助下，我想出了这个二进制代码：

def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)

相关问题更多 >

编程相关推荐

热门问题

热门文章