Pandas多柱分层sklearn列车试验

train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['b']]) print(len(train.a.values)) # prints 800000 print(len(set(train.a.values))) # prints 800000 train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['c']]) print(len(train.a.values)) # prints 800000 print(len(set(train.a.values))) # prints 800000

3条回答

网友

1楼 · 编辑于 2024-05-20 15:02:27

得到重复的原因是因为train_test_split()最终将strata定义为传入stratify参数的任何值的唯一值集。由于层是由两列定义的，因此一行数据可能代表多个层，因此采样可能会选择同一行两次，因为它认为它是从不同的类采样。

函数train_test_split()，在y上的uses{}（这是通过stratify传入的）。从源代码：

classes, y_indices = np.unique(y, return_inverse=True)
n_classes = classes.shape[0]

这是一个简化的示例，是您提供的示例的变体：

from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

N = 20
a = np.arange(N)
b = np.random.choice(["foo","bar"], size=N)
c = np.random.choice(["y","z"], size=N)
df = pd.DataFrame({'a':a, 'b':b, 'c':c})

print(df)
     a    b  c
0    0  bar  y
1    1  foo  y
2    2  bar  z
3    3  bar  y
4    4  foo  z
5    5  bar  y
...

分层函数认为有四个类可以拆分：foo、bar、y和z。但是由于这些类本质上是嵌套的，意味着y和z都出现在b == foo和b == bar中，所以当拆分器尝试从每个类中进行采样时，我们将获得重复项。

train, test = train_test_split(df, test_size=0.2, random_state=0, 
                               stratify=df[['b', 'c']])
print(len(train.a.values))  # 16
print(len(set(train.a.values)))  # 12

print(train)
     a    b  c
3    3  bar  y   # selecting a = 3 for b = bar*
5    5  bar  y
13  13  foo  y
4    4  foo  z
14  14  bar  z
10  10  foo  z
3    3  bar  y   # selecting a = 3 for c = y
6    6  bar  y
16  16  foo  y
18  18  bar  z
6    6  bar  y
8    8  foo  y
18  18  bar  z
7    7  bar  z
4    4  foo  z
19  19  bar  y

#* We can't be sure which row is selecting for `bar` or `y`, 
#  I'm just illustrating the idea here.

这里有一个更大的设计问题：您是想使用嵌套分层抽样，还是实际上只想将df.b和df.c中的每个类作为单独的类进行抽样？如果是后者，那就是你已经得到的。前者更复杂，而这不是train_test_split的目的。

您可能会发现this discussion嵌套分层抽样很有用。

网友
2楼 · 编辑于 2024-05-20 15:02:27

如果希望train_test_split按预期的方式运行（按多个列分层，没有重复项），请创建一个新列，该列是其他列中的值的串联，并在新列上分层。
df['bc'] = df['b'].astype(str) + df['c'].astype(str) train, test = train_test_split(df, test_size=0.2, random_state=0, stratify=df[['bc']])
如果您担心由于11和3以及1和13这两个值都会创建一个连接值113，那么您可以在中间添加一些任意字符串：
df['bc'] = df['b'].astype(str) + "_" + df['c'].astype(str)

网友
3楼 · 编辑于 2024-05-20 15:02:27

你在用什么版本的scikit learn？您可以使用sklearn.__version__进行检查。

在0.19.0之前的版本中，scikit learn无法正确处理二维分层。它是在0.19.0中修补的。

它在issue #9044中有描述。

更新scikit learn应该可以解决问题。如果无法更新scikit学习，请参阅此提交历史记录here以获取修复。

相关问题更多 >

编程相关推荐

热门问题

热门文章