为什么DecisionTreeClassifier（0.23.1 sklearn）根据输入的列顺序给出不同的结果？

from sklearn.datasets import load_iris import numpy as np iris = load_iris() X = iris['data'] y = iris['target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90, random_state=0) clf = DecisionTreeClassifier(random_state=0) clf.fit(X_train, y_train) print(clf.score(X_test, y_test)) clf = DecisionTreeClassifier(random_state=0) clf.fit(np.hstack((X_train[:,1:], X_train[:,:1])), y_train) print(clf.score(X_test, y_test)) clf = DecisionTreeClassifier(random_state=0) clf.fit(np.hstack((X_train[:,2:], X_train[:,:2])), y_train) print(clf.score(X_test, y_test)) clf = DecisionTreeClassifier(random_state=0) clf.fit(np.hstack((X_train[:,3:], X_train[:,:3])), y_train) print(clf.score(X_test, y_test))

from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split import numpy as np iris = load_iris() y = iris['target'] iris_features = iris['feature_names'] iris = pd.DataFrame(iris['data'], columns=iris['feature_names'])

X = iris[iris_features].values print(X.shape[1], iris_features) # 4 ['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0) clf = DecisionTreeClassifier(random_state=0) clf.fit(X_train, y_train) pred = clf.predict(X_test) print(np.mean(y_test == pred)) # 0.7062937062937062

X = iris[iris_features[2:]+iris_features[:2]].values print(X.shape[1], iris_features[2:]+iris_features[:2]) # 4 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0) clf = DecisionTreeClassifier(random_state=0) clf.fit(X_train, y_train) pred = clf.predict(X_test) print(np.mean(y_test == pred)) # 0.8881118881118881

1条回答

网友

1楼 · 发布于 2024-09-26 18:19:16

您没有在测试数据（X_test）中应用列排序。当你对测试数据做同样的操作时，你会得到相同的分数

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()

X = iris['data']
y = iris['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90, random_state=0)


def shuffle_data(data, n):
    return np.hstack((data[:,n:], data[:,:n]))

clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
# 0.9407407407407408

clf = DecisionTreeClassifier(random_state=0)
clf.fit(shuffle_data(X_train,1), y_train)
print(clf.score(shuffle_data(X_test,1), y_test))
# 0.9407407407407408

clf = DecisionTreeClassifier(random_state=0)
clf.fit(shuffle_data(X_train,2), y_train)
print(clf.score(shuffle_data(X_test,2), y_test))
# 0.9407407407407408

clf = DecisionTreeClassifier(random_state=0)
clf.fit(shuffle_data(X_train,3), y_train)
print(clf.score(shuffle_data(X_test,3), y_test))
# 0.9407407407407408

更新：

在第二个示例中，您将test_size设置为0.95，这只剩下7个数据点，它们的类是array([0, 0, 0, 2, 1, 2, 0])

如果在这两种情况下测量决策树的训练分数，则为1.0。这告诉我们，模型在两种情况下都找到了最佳分离

简单的答案是肯定的，当列顺序改变时，结果会有所不同，当不同的规则组合（不同的拆分条件）可能导致数据点的完美分离（100%准确度）时

使用plot_tree我们可以可视化树。这里我们需要了解DecisionTree的实现This答案引用了文档中的要点：

The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.

我们需要集中注意的一点是 practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node当采用贪婪算法时，列顺序的改变会影响其结果

同时，当数据集中有更多的数据点时（当不在示例中时），当您更改列的顺序时，不太可能得到不同的结果

即使在这个例子中，当设置test_size=0.90时，我们也可以得到与0.9407407407407408相同的分数

更新：

简单的答案是肯定的，当列顺序改变时，结果会有所不同，当不同的规则组合（不同的拆分条件）可能导致数据点的完美分离（100%准确度）时

相关问题更多 >

编程相关推荐

热门问题

热门文章