<p>有几个方面是错误的</p>
<p><code>X_train,y_train = oversample.fit_resample(X_train,y_train)</code></p>
<p>在交叉验证之前,您不能这样做。您正在使用验证集中的信息对训练集进行过采样</p>
<p><code>X_train = scaler.fit_transform(X_train)</code></p>
<p>不能缩放整个数据集,然后运行交叉验证。您正在使用将进入验证集(每轮CV)的样本来估计平均值和sd。那是不对的</p>
<p>一种实施方法是:</p>
<pre><code>kf = KFold(n_splits=10)
acc = np.zeros(10)
k=0
for train_index, test_index in kf.split(X_train):
X_tr = X_train[train_index, :]
y_tr = y_train[train_index]
X_te = X_test[test_index, :]
y_te = y_train[test_index]
scaler = StandardScaler()
X_tr = scaler.fit_transform(X_tr)
X_te = scaler.transform(X_te)
oversample = SMOTE()
X_tr,y_tr = oversample.fit_resample(X_tr,y_tr)
classifier = RandomForestClassifier(
n_estimators=100,
criterion='gini',
max_depth=22,
min_samples_split=2,
min_samples_leaf=1,
min_weight_fraction_leaf=0.0,
max_features='auto',
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
bootstrap=True,
oob_score=False,
n_jobs=-1,
random_state=0,
verbose=0,
warm_start=False,
class_weight='balanced'
)
classifier.fit(X_tr, y_tr)
y_pr = classifier.predict(X_te)
acc[k] = np.sum(y_te == y_pr) / len(y_te)
k+=1
np.mean(acc)
</code></pre>