数据不平衡的多标签图像分类，如何分割？

from imblearn.over_sampling import SMOTE smote = SMOTE() from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)

1条回答

网友

1楼 · 发布于 2024-06-28 13:38:11

您的问题混合了两个概念：将多类、多标签图像数据集拆分为具有比例表示的子集，以及处理类不平衡的重采样方法。我将只关注问题的分裂部分，因为这就是标题的内容

我将使用分层洗牌分割，以确保每个子集具有相等的重复。这是一个方便的维基百科分层抽样的可视化工具：

Stratified Sampling example. Source: Wikipedia

为此，我推荐^{}的^{}方法。它支持多标签数据集

    from skmultilearn.model_selection.iterative_stratification import IterativeStratification

    stratifier = IterativeStratification(
        n_splits=2, order=2, sample_distribution_per_fold=[1.0 - train_fraction, train_fraction],
    )
    # this class is a generator that produces k-folds. we just want to iterate it once to make a single static split
    # NOTE: needs to be computed on hard labels.
    train_indexes, everything_else_indexes = next(stratifier.split(X=img_urls, y=labels))

    # s3url array shape (N_samp,)
    x_train, x_else = img_urls[train_indexes], img_urls[everything_else_indexes]
    # labels array shape (N_samp, n_classes)
    Y_train, Y_else = labels[train_indexes, :], labels[everything_else_indexes, :]

我在a blog post中编写了一个更完整的解决方案，包括单元测试

skmultilearn的一个缺点是它没有得到很好的维护，并且有一些坏的功能。我在我的博客文章中记录了一些尖锐的角落和陷阱。还要注意的是，当你得到几百万张图像时，这个分层过程非常缓慢，因为分层器只使用一个CPU

相关问题更多 >

编程相关推荐

热门问题

热门文章