更改sklearnDecisionTreeClassifier
输入列的顺序时,精度似乎会发生变化。不应该是这样。我做错了什么
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris['data']
y = iris['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90, random_state=0)
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,1:], X_train[:,:1])), y_train)
print(clf.score(X_test, y_test))
clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,2:], X_train[:,:2])), y_train)
print(clf.score(X_test, y_test))
clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,3:], X_train[:,:3])), y_train)
print(clf.score(X_test, y_test))
运行此代码将产生以下输出
0.9407407407407408
0.22962962962962963
0.34074074074074073
0.3333333333333333
三年前就有人问过这个问题,但被问者被否决了,因为没有提供代码。 Does feature order impact Decision tree algorithm in sklearn?
编辑
在上面的代码中,我忘记对测试数据应用列重新排序
我发现在对整个数据集应用重新排序时,不同的结果仍然存在
首先,我导入数据并将其转换为数据帧
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np
iris = load_iris()
y = iris['target']
iris_features = iris['feature_names']
iris = pd.DataFrame(iris['data'], columns=iris['feature_names'])
然后,我通过原始有序特征名称选择所有数据。我对模型进行培训和评估
X = iris[iris_features].values
print(X.shape[1], iris_features)
# 4 ['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0)
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(np.mean(y_test == pred))
# 0.7062937062937062
为什么我仍然得到不同的结果? 然后,我选择相同列的不同顺序来训练和评估模型
X = iris[iris_features[2:]+iris_features[:2]].values
print(X.shape[1], iris_features[2:]+iris_features[:2])
# 4 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0)
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(np.mean(y_test == pred))
# 0.8881118881118881
您没有在测试数据(
X_test
)中应用列排序。当你对测试数据做同样的操作时,你会得到相同的分数更新:
在第二个示例中,您将
test_size
设置为0.95,这只剩下7个数据点,它们的类是array([0, 0, 0, 2, 1, 2, 0])
如果在这两种情况下测量决策树的训练分数,则为
1.0
。这告诉我们,模型在两种情况下都找到了最佳分离简单的答案是肯定的,当列顺序改变时,结果会有所不同,当不同的规则组合(不同的拆分条件)可能导致数据点的完美分离(100%准确度)时
使用
plot_tree
我们可以可视化树。这里我们需要了解DecisionTree
的实现This答案引用了文档中的要点:我们需要集中注意的一点是
practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node
当采用贪婪算法时,列顺序的改变会影响其结果同时,当数据集中有更多的数据点时(当不在示例中时),当您更改列的顺序时,不太可能得到不同的结果
即使在这个例子中,当设置
test_size=0.90
时,我们也可以得到与0.9407407407407408
相同的分数相关问题 更多 >
编程相关推荐