处理不平衡数据集中的过度拟合

2024-10-06 12:39:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个不平衡的数据集(只有0.06%的数据被标记为1,其余的被标记为0)。在我研究的过程中,我不得不对数据进行重采样,所以我使用了imblearn包来randomUnserSample我的数据集。然后我用Keras Sequential创建了一个神经网络。训练时,F1Score增加到75%左右(第1000个历元的结果是:损失:0.5691-acc:0.7543-f1\m:0.7525-precision\m:0.7582-recall\m:0.7472),但在测试集上,结果令人失望(损失:55.35181%,acc:79.25248%,f1\m:0.39789%,precision\m:0.23259%,recall\m:1.54982%)

我的假设是,在列车组上,因为1和0的数量是相同的,因此类值都设置为1,所以网络不会因为1的错误预测而花费太多

我使用了一些技术,比如减少层的数量,减少神经元的数量,使用正则化和退出,但是测试集f1Score永远不会超过0.5%。我该怎么办。谢谢

我的神经网络:

def neural_network(X, y, epochs_count=3, handle_overfit=False):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=len(X_test.columns), activation='relu'))
    if (handle_overfit):
        model.add(Dropout(rate = 0.5))
    model.add(Dense(8, activation='relu', kernel_regularizer=regularizers.l1(0.1)))
    if (handle_overfit):
        model.add(Dropout(rate = 0.1))
    model.add(Dense(1, activation='sigmoid'))

    # compile the model
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['acc', f1_m, precision_m, recall_m])

#     change weights of the classes '0' and '1' and set weights automatically
    class_weights = class_weight.compute_class_weight('balanced', [0, 1], y)
    print("---------------------- \n chosen class_wieghts are: ", class_weights, " \n ---------------------")

    # Fit the model
    model.fit(X, y, epochs=epochs_count, batch_size=512, class_weight=class_weights)

    return model

定义列车和测试集:

vtrain_set, test_set = train_test_split(data, test_size=0.35, random_state=0)

X_train = train_set[['..... some columns ....']]
y_train = train_set[['success']]

print('Initial dataset shape: ', X_train.shape)
rus = RandomUnderSampler(random_state=42)
X_undersampled, y_undersampled = rus.fit_sample(X_train, y_train) 
print('undersampled dataset shape: ', X_undersampled.shape)

结果是:

Initial dataset shape:  (1625843, 11)
undersampled dataset shape:  (1970, 11)

最后是神经网络调用:

print (X_undersampled.shape, y_undersampled.shape)
print (X_test.shape, y_test.shape)

model = neural_network(X_undersampled, y_undersampled, 1000, handle_overfit=True)

# evaluate the model
print("\n---------------\nEvaluated on test set:")

scores = model.evaluate(X_test, y_test)
for i in range(len(model.metrics_names)):
    print("%s: %.5f%%" % (model.metrics_names[i], scores[i]*100))

结果是:

(1970, 11) (1970,)
(875454, 11) (875454, 1)
---------------------- 
 chosen class_wieghts are:  [1. 1.]  
 ---------------------
Epoch 1/1000
1970/1970 [==============================] - 4s 2ms/step - loss: 4.5034 - acc: 0.5147 - f1_m: 0.3703 - precision_m: 0.5291 - recall_m: 0.2859

.
.
.
.
Epoch 999/1000
1970/1970 [==============================] - 0s 6us/step - loss: 0.5705 - acc: 0.7538 - f1_m: 0.7471 - precision_m: 0.7668 - recall_m: 0.7296
Epoch 1000/1000
1970/1970 [==============================] - 0s 6us/step - loss: 0.5691 - acc: 0.7543 - f1_m: 0.7525 - precision_m: 0.7582 - recall_m: 0.7472

---------------
Evaluated on test set:
875454/875454 [==============================] - 49s 56us/step
loss: 55.35181%
acc: 79.25248%
f1_m: 0.39789%
precision_m: 0.23259%
recall_m: 1.54982%

Tags: testaddmodeltrainclassprecisionf1acc