总人口的相关系数大于人口样本

2024-05-21 06:49:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我创建了两个分类器,一个增强决策树(BDT)和一个神经网络(NN),用于将事件分类为属于信号类或背景类。它们输出属于信号类的0到1之间的连续概率。我想比较这两种方法,并希望找到两者之间的相关性

但是我发现,如果我只计算属于背景类的事件的相关系数,或者只计算属于信号类的事件的相关系数,那么这些相关性比整个数据集的相关性要小。我假设两个分类器在完全相同的数据集上进行测试,总相关性将是两个单独相关性的加权平均值。请注意,整个数据集由大约100000个事件组成

这里,我使用pandas.corr()函数计算整个数据集的相关性,该函数计算Pearson相关性矩阵:

dfBDT = pd.read_csv("BDTResults.csv")
dfNN = pd.read_csv("NNResults.csv")

# not defaulted by Event Number by default
dfBDT = dfBDT.sort_values('EventNumber')
dfNN = dfNN.sort_values('EventNumber')

# Resets index of sorted dataframe so sorted dataframe index begins at 0
dfBDT.reset_index(drop=True, inplace=True)
dfNN.reset_index(drop=True, inplace=True)

dfscore = pd.concat([dfBDT['score'],dfNN['score']], axis = 1)
dfnum = pd.concat([dfBDT['EventNumber'],dfNN['EventNumber']], axis = 1)

dfTotal = pd.concat([dfnum,dfscore], axis = 1)
dfTotal.columns = ['EventNumberBDT', 'EventNumberNN', 'BDT', 'NN']

dfTotal.corr()

这提供了97%的相关性。然后,我仅对后台事件执行相同的操作,我已将后台事件定义为0类:

BDT_back = (dfBDT.loc[dfBDT['Class'] == 0])['score']
BDT_back.reset_index(drop=True, inplace=True)

BDT_back_num = (dfBDT.loc[dfBDT['Class'] == 0])['EventNumber']
BDT_back_num.reset_index(drop=True, inplace=True)


NN_back = (dfNN.loc[dfNN['Class'] == 0])['score']
NN_back.reset_index(drop=True, inplace=True)

NN_back_num = (dfBDT.loc[dfBDT['Class'] == 0])['EventNumber']
NN_back_num.reset_index(drop=True, inplace=True)



dfBack = pd.concat([BDT_back_num,NN_back_num,BDT_back,NN_back],
                   axis = 1)
dfBack.reset_index(drop=True, inplace=True)

dfBack.columns = ['EventNumberBDT','EventNumberNN','BDT','NN']

dfBack.corr()

这给了我大约96%的相关性。然后我对信号事件重复上述步骤,即用class=1替换class=0,得到91%的相关性

然后,如果我尝试重新连接两个数据帧并再次计算总相关性,我会得到比之前更高的相关性,即98%:

ab = pd.concat([dfBack['BDT'],dfSig['BDT']])
ba = pd.concat([dfBack['NN'],dfSig['NN']])

abba =pd.concat([ab,ba], axis = 1)
abba.corr()

事实上,这两个值是不同的,这一定意味着有什么地方出了问题,但我不知道在哪里


Tags: bdttrueindexback事件nnnumdrop
1条回答
网友
1楼 · 发布于 2024-05-21 06:49:24

最终,它归结为在索引上运行的水平合并

不匹配的行

如果两个数据帧的行不同,则默认为外部联接的concat将在不匹配的索引处(在较小行的数据帧上)生成NaN,这将比拆分前的原始数据帧多行

不匹配的类

此外,如果在两个数据帧dfBDTdfNN之间具有不同的%共享,则它们相应的联接将在不匹配的索引处返回NaN

例如,假设dfBDT在0类和1类之间保持60%和40%,而dfNN在0类和1类之间保持50%和50%,其中比较包括:

  • BDT类0将具有比NN类0更多的
  • BDT类1的行数将少于NN类1的行数

在与默认为外部联接的pd.concat(..., axis = 1)进行水平联接之后,how = 'outer',产生的不匹配将在两侧生成NaN。即使您确实使用了how='inner,您也会过滤掉不匹配的行,但是dfTotal从不过滤掉任何行,而是包含所有

排序顺序

在Linux和Windows机器之间使用种子、可重复的示例进行测试,表明排序问题,特别是按Class先排序,然后再排序EventNumber问题


这可以通过可复制示例的种子随机数据来证明。下面对代码进行重构,以避免使用join进行大量pd.concat调用(将其默认值调整为how='outer')。再往下看,这段代码相当于OP的原始设置

数据

import numpy as np
import pandas as pd

np.random.seed(2292020)
dfBDT = pd.DataFrame({'EventNumber': np.random.randint(1, 15, 500),
                      'Class': np.random.randint(0, 1, 500),
                      'score': np.random.randn(500)
                     })


dfNN = pd.DataFrame({'EventNumber': np.random.randint(1, 15, 500),
                     'Class': np.random.randint(0, 1, 500),
                     'score': np.random.randn(500)
                    })

代码

dfBDT = dfBDT.sort_values(['Class', 'EventNumber']).reset_index(drop=True)    
dfNN = dfNN.sort_values(['Class', 'EventNumber']).reset_index(drop=True)  

# ALL ROWS (NO FILTER)
dfTotal = (dfBDT.reindex(['EventNumber', 'score'], axis='columns')
                .join(dfNN.reindex(['EventNumber', 'score'], axis='columns'),
                      rsuffix = '_')
                .set_axis(['EventNumberBDT', 'BDT', 'EventNumberNN', 'NN'], 
                          axis='columns', inplace = False)
                .reindex(['EventNumberBDT','EventNumberNN','BDT','NN'], 
                         axis='columns'))    
dfTotal.corr()

# TWO FILTERED DATA FRAMES CLASS (0 FOR BACKGROUND, 1 FOR SIGNAL)
df_list = [(dfBDT.query('Class == {}'.format(i))
                 .reindex(['EventNumber', 'score'], axis='columns')
                 .join(dfNN.query('Class == {}'.format(i))
                           .reindex(['EventNumber', 'score'], axis='columns'),
                       rsuffix = '_')
                 .set_axis(['EventNumberBDT', 'BDT', 'EventNumberNN', 'NN'],
                           axis='columns', inplace = False)

                 .reindex(['EventNumberBDT','EventNumberNN','BDT','NN'],
                          axis='columns')
           ) for i in range(0,2)]

dfSub = pd.concat(df_list)

dfSub.corr()

输出(注意它们返回不同的结果)

dfTotal.corr()
#                 EventNumberBDT  EventNumberNN       BDT        NN
# EventNumberBDT        1.000000       0.912279 -0.024121  0.115754
# EventNumberNN         0.912279       1.000000 -0.039038  0.122905
# BDT                  -0.024121      -0.039038  1.000000  0.012143
# NN                    0.115754       0.122905  0.012143  1.000000

dfSub.corr()
#                 EventNumberBDT  EventNumberNN       BDT        NN
# EventNumberBDT        1.000000       0.974140 -0.024121  0.120102
# EventNumberNN         0.974140       1.000000 -0.026026  0.122905
# BDT                  -0.024121      -0.026026  1.000000  0.025548
# NN                    0.120102       0.122905  0.025548  1.000000

然而,如果我们将共享相等(例如两个数据帧中的50%和50%或两个数据帧中的任何等效共享),则输出完全匹配

np.random.seed(2292020)
dfBDT = pd.DataFrame({'EventNumber': np.random.randint(1, 15, 500),
                      'Class': np.concatenate((np.zeros(250), np.ones(250))),
                      'score': np.random.randn(500)
                     })


dfNN = pd.DataFrame({'EventNumber': np.random.randint(1, 15, 500),
                     'Class': np.concatenate((np.zeros(250), np.ones(250))),
                     'score': np.random.randn(500)
                    })

...

dfTotal.corr()
#                 EventNumberBDT  EventNumberNN       BDT        NN
# EventNumberBDT        1.000000       0.992846 -0.026130  0.023623
# EventNumberNN         0.992846       1.000000 -0.023411  0.022093
# BDT                  -0.026130      -0.023411  1.000000 -0.026454
# NN                    0.023623       0.022093 -0.026454  1.000000


dfSub.corr()
#                 EventNumberBDT  EventNumberNN       BDT        NN
# EventNumberBDT        1.000000       0.992846 -0.026130  0.023623
# EventNumberNN         0.992846       1.000000 -0.023411  0.022093
# BDT                  -0.026130      -0.023411  1.000000 -0.026454
# NN                    0.023623       0.022093 -0.026454  1.000000

最后,这已经用OP的原始代码进行了测试:

def op_approach_total():
    dfscore = pd.concat([dfBDT['score'],dfNN['score']], axis = 1)
    dfnum = pd.concat([dfBDT['EventNumber'],dfNN['EventNumber']], axis = 1)

    dfTotal = pd.concat([dfnum,dfscore], axis = 1)
    dfTotal.columns = ['EventNumberBDT', 'EventNumberNN', 'BDT', 'NN']

    return dfTotal.corr()


def op_approach_split():
    # not defaulted by Event Number by default
    BDT_back = (dfBDT.loc[dfBDT['Class'] == 0])['score']
    BDT_back.reset_index(drop=True, inplace=True)

    BDT_back_num = (dfBDT.loc[dfBDT['Class'] == 0])['EventNumber']
    BDT_back_num.reset_index(drop=True, inplace=True)


    NN_back = (dfNN.loc[dfNN['Class'] == 0])['score']
    NN_back.reset_index(drop=True, inplace=True)

    NN_back_num = (dfNN.loc[dfNN['Class'] == 0])['EventNumber'] 
    NN_back_num.reset_index(drop=True, inplace=True)


    dfBack = pd.concat([BDT_back_num,NN_back_num,BDT_back,NN_back],
                       axis = 1)
    dfBack.reset_index(drop=True, inplace=True)
    dfBack.columns = ['EventNumberBDT','EventNumberNN','BDT','NN']


    # not defaulted by Event Number by default
    BDT_sig = (dfBDT.loc[dfBDT['Class'] == 1])['score']
    BDT_sig.reset_index(drop=True, inplace=True)

    BDT_sig_num = (dfBDT.loc[dfBDT['Class'] == 1])['EventNumber']
    BDT_sig_num.reset_index(drop=True, inplace=True)

    NN_sig = (dfNN.loc[dfNN['Class'] == 1])['score']
    NN_sig.reset_index(drop=True, inplace=True)

    NN_sig_num = (dfNN.loc[dfNN['Class'] == 1])['EventNumber']
    NN_sig_num.reset_index(drop=True, inplace=True)


    dfSig = pd.concat([BDT_sig_num, NN_sig_num, BDT_sig, NN_sig],
                       axis = 1)
    dfSig.reset_index(drop=True, inplace=True)
    dfSig.columns = ['EventNumberBDT','EventNumberNN','BDT','NN']

    # ADDING EventNumber COLUMNS
    ev_back = pd.concat([dfBack['EventNumberBDT'], dfSig['EventNumberBDT']])
    ev_sig = pd.concat([dfBack['EventNumberNN'], dfSig['EventNumberNN']])


    ab = pd.concat([dfBack['BDT'], dfSig['BDT']])

    ba = pd.concat([dfBack['NN'], dfSig['NN']])

    # HORIZONTAL MERGE
    abba = pd.concat([ev_back, ev_sig, ab, ba], axis = 1)

    return abba.corr()

opTotal = op_approach_total()
opSub = op_approach_split()

输出

opTotal = op_approach_total()
opTotal
#                 EventNumberBDT  EventNumberNN       BDT        NN
# EventNumberBDT        1.000000       0.992846 -0.026130  0.023623
# EventNumberNN         0.992846       1.000000 -0.023411  0.022093
# BDT                  -0.026130      -0.023411  1.000000 -0.026454
# NN                    0.023623       0.022093 -0.026454  1.000000

opSub = op_approach_split()
opSub
#                 EventNumberBDT  EventNumberNN       BDT        NN
# EventNumberBDT        1.000000       0.992846 -0.026130  0.023623
# EventNumberNN         0.992846       1.000000 -0.023411  0.022093
# BDT                  -0.026130      -0.023411  1.000000 -0.026454
# NN                    0.023623       0.022093 -0.026454  1.000000

相关问题 更多 >