我有一个不平衡的数据集(只有0.06%的数据被标记为1,其余的被标记为0)。在我研究的过程中,我不得不对数据进行重采样,所以我使用了imblearn
包来randomUnserSample
我的数据集。然后我用Keras Sequential
创建了一个神经网络。训练时,F1Score
增加到75%左右(第1000个历元的结果是:损失:0.5691-acc:0.7543-f1\m:0.7525-precision\m:0.7582-recall\m:0.7472),但在测试集上,结果令人失望(损失:55.35181%,acc:79.25248%,f1\m:0.39789%,precision\m:0.23259%,recall\m:1.54982%)
我的假设是,在列车组上,因为1和0的数量是相同的,因此类值都设置为1,所以网络不会因为1的错误预测而花费太多
我使用了一些技术,比如减少层的数量,减少神经元的数量,使用正则化和退出,但是测试集f1Score
永远不会超过0.5%。我该怎么办。谢谢
我的神经网络:
def neural_network(X, y, epochs_count=3, handle_overfit=False):
# create model
model = Sequential()
model.add(Dense(12, input_dim=len(X_test.columns), activation='relu'))
if (handle_overfit):
model.add(Dropout(rate = 0.5))
model.add(Dense(8, activation='relu', kernel_regularizer=regularizers.l1(0.1)))
if (handle_overfit):
model.add(Dropout(rate = 0.1))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['acc', f1_m, precision_m, recall_m])
# change weights of the classes '0' and '1' and set weights automatically
class_weights = class_weight.compute_class_weight('balanced', [0, 1], y)
print("---------------------- \n chosen class_wieghts are: ", class_weights, " \n ---------------------")
# Fit the model
model.fit(X, y, epochs=epochs_count, batch_size=512, class_weight=class_weights)
return model
定义列车和测试集:
vtrain_set, test_set = train_test_split(data, test_size=0.35, random_state=0)
X_train = train_set[['..... some columns ....']]
y_train = train_set[['success']]
print('Initial dataset shape: ', X_train.shape)
rus = RandomUnderSampler(random_state=42)
X_undersampled, y_undersampled = rus.fit_sample(X_train, y_train)
print('undersampled dataset shape: ', X_undersampled.shape)
结果是:
Initial dataset shape: (1625843, 11)
undersampled dataset shape: (1970, 11)
最后是神经网络调用:
print (X_undersampled.shape, y_undersampled.shape)
print (X_test.shape, y_test.shape)
model = neural_network(X_undersampled, y_undersampled, 1000, handle_overfit=True)
# evaluate the model
print("\n---------------\nEvaluated on test set:")
scores = model.evaluate(X_test, y_test)
for i in range(len(model.metrics_names)):
print("%s: %.5f%%" % (model.metrics_names[i], scores[i]*100))
结果是:
(1970, 11) (1970,)
(875454, 11) (875454, 1)
----------------------
chosen class_wieghts are: [1. 1.]
---------------------
Epoch 1/1000
1970/1970 [==============================] - 4s 2ms/step - loss: 4.5034 - acc: 0.5147 - f1_m: 0.3703 - precision_m: 0.5291 - recall_m: 0.2859
.
.
.
.
Epoch 999/1000
1970/1970 [==============================] - 0s 6us/step - loss: 0.5705 - acc: 0.7538 - f1_m: 0.7471 - precision_m: 0.7668 - recall_m: 0.7296
Epoch 1000/1000
1970/1970 [==============================] - 0s 6us/step - loss: 0.5691 - acc: 0.7543 - f1_m: 0.7525 - precision_m: 0.7582 - recall_m: 0.7472
---------------
Evaluated on test set:
875454/875454 [==============================] - 49s 56us/step
loss: 55.35181%
acc: 79.25248%
f1_m: 0.39789%
precision_m: 0.23259%
recall_m: 1.54982%
目前没有回答
相关问题 更多 >
编程相关推荐