机器学习：不平衡数据的分类

id,claimcst0,veh_value,exposure,veh_body,veh_age,gender,area,agecat,clm,numclaims 1,0,6.43,0.241897754,STNWG,1,M,A,3,0,0 2,0,4.46,0.856522757,STNWG,1,M,A,3,0,0 3,0,1.7,0.417516596,HBACK,1,M,A,4,0,0 4,0,0.48,0.626974524,SEDAN,4,F,A,6,0,0 5,0,1.96,0.089770031,HBACK,1,F,A,2,0,0 6,0,1.78,0.25654335,HBACK,2,M,A,3,0,0 7,0,2.7,0.688128611,UTE,2,M,A,1,0,0 8,0,0.94,0.912765859,STNWG,4,M,A,2,0,0 9,0,1.98,0.157753423,SEDAN,2,M,A,4,0,0

3条回答

网友

1楼 · 编辑于 2024-09-30 00:33:09

可以为不平衡的数据集指定class\u weight参数。例如，在本例中，由于标签1只有8%的数据，所以在进行分类时，您将赋予标签更高的权重。在

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

class_weight : {dict, ‘balanced’}, optional Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

网友

2楼 · 编辑于 2024-09-30 00:33:09

这是一个相当普遍的挑战，你的两个类别不平衡。为了克服只预测一个类别的问题，你必须使用一个平衡的训练集。有几种解决方案，最基本的是均匀地采样数据。既然你有1500个1的样本，你也应该得到1500个0的样本

n = 1500
sample_yes = data.ix[data.y == 1].sample(n=n, replace=False, random_state=0)
sample_no = data.ix[data.y == 0].sample(n=n, replace=False, random_state=0)
df = pd.concat([sample_yes, sample_no])

其中data是原始数据帧。您应该在将数据拆分到训练集和测试集之前执行此操作。在

网友

3楼 · 编辑于 2024-09-30 00:33:09

如果不是更糟的话，我确实有这个问题。我发现的一个解决方案是根据以下内容对1进行过采样：

http://www.data-mining-blog.com/tips-and-tutorials/overrepresentation-oversampling/

https://yiminwu.wordpress.com/2013/12/03/how-to-undo-oversampling-explained/

相关问题更多 >

编程相关推荐

热门问题

热门文章