为什么DecisionTreeClassifier(0.23.1 sklearn)根据输入的列顺序给出不同的结果?

2024-09-26 18:19:16 发布

您现在位置:Python中文网/ 问答频道 /正文

更改sklearnDecisionTreeClassifier输入列的顺序时,精度似乎会发生变化。不应该是这样。我做错了什么

from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()

X = iris['data']
y = iris['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90, random_state=0)


clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))

clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,1:], X_train[:,:1])), y_train)
print(clf.score(X_test, y_test))

clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,2:], X_train[:,:2])), y_train)
print(clf.score(X_test, y_test))

clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,3:], X_train[:,:3])), y_train)
print(clf.score(X_test, y_test))

运行此代码将产生以下输出

0.9407407407407408
0.22962962962962963
0.34074074074074073
0.3333333333333333

三年前就有人问过这个问题,但被问者被否决了,因为没有提供代码。 Does feature order impact Decision tree algorithm in sklearn?


编辑

在上面的代码中,我忘记对测试数据应用列重新排序

我发现在对整个数据集应用重新排序时,不同的结果仍然存在

首先,我导入数据并将其转换为数据帧

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
y = iris['target']
iris_features = iris['feature_names']
iris = pd.DataFrame(iris['data'], columns=iris['feature_names'])

然后,我通过原始有序特征名称选择所有数据。我对模型进行培训和评估

X = iris[iris_features].values
print(X.shape[1], iris_features)
# 4 ['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0)

clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

print(np.mean(y_test == pred))
# 0.7062937062937062

为什么我仍然得到不同的结果? 然后,我选择相同列的不同顺序来训练和评估模型

X = iris[iris_features[2:]+iris_features[:2]].values
print(X.shape[1], iris_features[2:]+iris_features[:2])
# 4 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0)

clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

print(np.mean(y_test == pred))
# 0.8881118881118881


Tags: fromtestimportirisnpcmtrainrandom
1条回答
网友
1楼 · 发布于 2024-09-26 18:19:16

您没有在测试数据(X_test)中应用列排序。当你对测试数据做同样的操作时,你会得到相同的分数

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()

X = iris['data']
y = iris['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90, random_state=0)


def shuffle_data(data, n):
    return np.hstack((data[:,n:], data[:,:n]))

clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
# 0.9407407407407408

clf = DecisionTreeClassifier(random_state=0)
clf.fit(shuffle_data(X_train,1), y_train)
print(clf.score(shuffle_data(X_test,1), y_test))
# 0.9407407407407408

clf = DecisionTreeClassifier(random_state=0)
clf.fit(shuffle_data(X_train,2), y_train)
print(clf.score(shuffle_data(X_test,2), y_test))
# 0.9407407407407408

clf = DecisionTreeClassifier(random_state=0)
clf.fit(shuffle_data(X_train,3), y_train)
print(clf.score(shuffle_data(X_test,3), y_test))
# 0.9407407407407408

更新:

在第二个示例中,您将test_size设置为0.95,这只剩下7个数据点,它们的类是array([0, 0, 0, 2, 1, 2, 0])

如果在这两种情况下测量决策树的训练分数,则为1.0。这告诉我们,模型在两种情况下都找到了最佳分离

简单的答案是肯定的,当列顺序改变时,结果会有所不同,当不同的规则组合(不同的拆分条件)可能导致数据点的完美分离(100%准确度)时

使用plot_tree我们可以可视化树。这里我们需要了解DecisionTree的实现This答案引用了文档中的要点:

The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.

我们需要集中注意的一点是 practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node当采用贪婪算法时,列顺序的改变会影响其结果

同时,当数据集中有更多的数据点时(当不在示例中时),当您更改列的顺序时,不太可能得到不同的结果

即使在这个例子中,当设置test_size=0.90时,我们也可以得到与0.9407407407407408相同的分数

相关问题 更多 >

    热门问题