如何对单个列使用分层

2024-06-15 01:06:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我在这方面是个新手。这就是为什么,我可能不确定我应该写什么作为我的问题。我试图尽可能简单地表达我的问题。我正在显示我的部分代码

print(data)

输出:

array([[0, 0, 0, ..., 255, 255, 255],
       [255, 255, 255, ..., 0, 0, 0],
       [255, 255, 255, ..., 255, 255, 255],
       ...,
       [255, 255, 255, ..., 255, 255, 255],
       [255, 255, 255, ..., 255, 255, 255],
       [255, 255, 255, ..., 255, 255, 255]], dtype=object)

print(result)

输出:

['Arrowhead' 'Arrowhead' 'Arrowhead' ... 'Vessel' 'Vessel' 'Vessel']

将标签转换为数字:

LE = LabelEncoder()
target = LE.fit_transform(result)

print(target) 

输出:

[ 0  0  0 ... 38 38 38]

拆分:

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42, stratify=target)

我得到了一个错误:

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

为了修复错误,我必须删除stratify,目前这可能还可以:

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

为了建立一个CNN,我必须这样做:

lb = preprocessing.LabelBinarizer()

y_train_categorical = lb.fit_transform(y_train)
y_test_categorical = lb.fit_transform(y_test)

print(y_train_categorical.shape)
print(y_test_categorical.shape)

输出:

(1945, 38)
(487, 34)

问题就在这里。我需要y轴的相同值(y_train_categorical.shape[1] & y_test_categorical.shape[1])。因为,我申请了:

model = Sequential()

model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(100,100,1)))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(38, activation='softmax'))

适用于model.fit():

model.fit(X_train, y_train_categorical, 
          batch_size=32, epochs=5, verbose=1)

但是,在测试评估时

loss, accuracy = model.evaluate(X_test, y_test_categorical, verbose=0)
print('Loss: ', loss,'\nAcc: ', accuracy)

我得到这个错误:

ValueError: Error when checking target: expected dense_2 to have shape (38,) but got array with shape (34,)

我怎样做y_train_categorical.shape[1]&y_test_categorical.shape[1]相同,或者有什么简单的解决方案可以解决我的上一个错误(在测试中评估模型时)


Tags: testaddtargetdatasizemodel错误train
2条回答

错误的解决方案:

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

该错误指出target变量中有一个类只出现一次。为了解释这一点,让我们考虑下面的例子:

random_list = ['a','a','a','b','b','c','d','d','e','e','e']
LE = LabelEncoder()
target = LE.fit_transform(random_list)
print(target)

给予

array([0, 0, 0, 1, 1, 2, 3, 3, 4, 4, 4])

现在,如果我尝试执行train_test_split,这将抛出一个错误

train_test_split(target, test_size=0.2, stratify=target)
#ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

这是因为我只出现了一次'c',这造成了在stratify=True时是否将其放入训练或测试的模糊性。因此,为了让它发挥作用,我们需要在每个类中出现一次以上的事件

以上示例的附加错误

即使我从上面的列表中删除'c',上面的解决方案也不起作用。我们遇到了另一个错误

random_list = ['a','a','a','b','b','d','d','e','e','e']
E = LabelEncoder()
target = LE.fit_transform(random_list) #produces array([0, 0, 0, 1, 1, 3, 3, 4, 4, 4])
train_test_split(target, test_size=0.2, stratify=target)
#ValueError: The test_size = 2 should be greater or equal to the number of classes = 4

为了使分层成功工作,您需要在训练和测试中都出现所有类。如果数据点的数量不足以创建适当的分布,则抛出上述错误。对于test_size=2,最多可以分层2个类

总的来说,不管错误和方法论如何,这:

y_train_categorical = lb.fit_transform(y_train)
y_test_categorical = lb.fit_transform(y_test)

错误:我们从不将预处理内容放在测试集上,我们重复使用火车集中的转换,即:

y_train_categorical = lb.fit_transform(y_train)
y_test_categorical = lb.transform(y_test) # transform only

如果测试集的所有标签都存在于列车集中,则也可以解决您的错误,这应该是一个形式良好的预测ML问题(否则问题本身定义不清)

如果lb.fit_transform(y_test)给出一个错误,说它遇到了以前不存在(和编码)的标签,这就意味着测试集中有新的、看不见的标签,这是您必须在这里纠正的真正问题,而不是一些编码错误

相关问题 更多 >