将数据分为训练和测试，以观察nam

136 137 138 139 141 143 144 145 146 \ Sample HC10 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.140901 HC10 0.000000 0.000000 0.000000 0.267913 0.0 0.0 0.0 0.0 0.000000 HC10 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.174445 HC11 0.059915 0.212442 0.255549 0.000000 0.0 0.0 0.0 0.0 0.000000 HC11 0.000000 0.115988 0.144056 0.070028 0.0 0.0 0.0 0.0 0.000000 147 148 149 150 151 152 154 156 158 \ Sample HC10 0.0 0.189937 0.0 0.052635 0.0 0.148751 0.00000 0.000000 0.0 HC10 0.0 0.000000 0.0 0.267764 0.0 0.000000 0.00000 0.000000 0.0 HC10 0.0 0.208134 0.0 0.130212 0.0 0.165507 0.00000 0.000000 0.0 HC11 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.06991 0.102209 0.0 HC11 0.0 0.065779 0.0 0.072278 0.0 0.060815 0.00000 0.060494 0.0 160 173 Sample HC10 0.051911 0.0 HC10 0.281227 0.0 HC10 0.000000 0.0 HC11 0.000000 0.0 HC11 0.073956 0.0

1条回答

网友

1楼 · 发布于 2024-10-01 13:45:03

每一组都可以保持平衡取样。我将修改你的小例子：

import pandas as pd
df = pd.DataFrame({
    'group': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], 
    'x':range(10)
})

train = df.reset_index(                  # need to keep the index as a column
    ).groupby('group'                    # split by "group"
    ).apply(lambda x: x.sample(frac=0.6) # in each group, do the random split
    ).reset_index(drop=True              # index now is group id - reset it
    ).set_index('index')                 # reset the original index
test = df.drop(train.index)              # now we can subtract it from the rest of data

另一个解决方案是使用分层抽样算法，例如scikit learn。在

相关问题更多 >

编程相关推荐

热门问题

热门文章