用于在Pandas中创建所有可能的列对的代码

2024-10-02 12:36:48 发布

您现在位置:Python中文网/ 问答频道 /正文

对于以下df

data=[['TAMU', 54, 0, 0, 6, 5, 0,],['UIUC', 33, 43, 5, 0, 76, 81],
['USC',4, 1, 0, 7, 21, 4], ['Austin',22,31, 0, 0,55, 0],
['UCLA', 55, 6, 7, 9, 11,12]]
pd.DataFrame(data,columns = ['Name', 'Research', 'Thesis', 
'Proposal', 'AI', 'Analytics', 'Data'])

我想为两个指定行(比如USC和UCLA)的所有可能的列组合(例如:AI,Analytics--Data,AI)创建列联表,以提供给我的chisquare函数

def overflow(school1,school2,alpha):
   pvals_list=[]
   data=[['TAMU', 54, 0, 0, 6, 5, 0,],['UIUC', 33, 43, 5, 0, 76, 81],
['USC',4, 1, 0, 7, 21, 4], ['Austin',22,31, 0, 0,55, 0],
['UCLA', 55, 6, 7, 9, 11,12]]
    pd.DataFrame(data,columns = ['Name', 'Research', 'Thesis', 'Proposal', 
'AI', 'Analytics', 'Data'])
   df=df[(df['Unnamed: 0'] == school1) | (df['Unnamed: 0'] == school2)]
   df=df.loc[:, df.ne(0).all()]
   df=df.set_index('Name')
   ###
   ####code to create columns pairs [for loop?]to feed to data_crosstab  below
   ###
       
          data_crosstab = pd.crosstab()
          chi,p_vals = stats.chi2_contingency(data_crosstab)[:2]
          if p > alpha:
              pvals_list.appned(p_vals)
  return(pvals_list)
overflow('USC','UCLA',0.05)

编辑:到目前为止,我已经尝试了几种不同的方法,但没有一种有效。我们将非常感谢您的帮助。


Tags: columnstonamedfdataailistanalytics
3条回答

这是你想要的吗

[x for x in combinations(['Name', 'Research', 'Thesis', 
'Proposal', 'AI', 'Analytics', 'Data'], 2)]

输出:

[('Name', 'Research'),
 ('Name', 'Thesis'),
 ('Name', 'Proposal'),
 ('Name', 'AI'),
 ('Name', 'Analytics'),
 ('Name', 'Data'),
 ('Research', 'Thesis'),
 ('Research', 'Proposal'),
 ('Research', 'AI'),
 ('Research', 'Analytics'),
 ('Research', 'Data'),
 ('Thesis', 'Proposal'),
 ('Thesis', 'AI'),
 ('Thesis', 'Analytics'),
 ('Thesis', 'Data'),
 ('Proposal', 'AI'),
 ('Proposal', 'Analytics'),
 ('Proposal', 'Data'),
 ('AI', 'Analytics'),
 ('AI', 'Data'),
 ('Analytics', 'Data')]

您需要将这两个数据传递到pd.crosstab以创建RxC Table

>>> data_crosstab = pd.crosstab(df.loc['USC'], df.loc['UCLA'])
UCLA  6   7   9   11  12  55
USC                         
0      0   1   0   0   0   0
1      1   0   0   0   0   0
4      0   0   0   0   1   1
7      0   0   1   0   0   0
21     0   0   0   1   0   0

然后您可以将其传递给scipy.stats.chi2_contingency以获得结果:

>>> stats.chi2_contingency(pd.crosstab(df.loc['USC'], df.loc['UCLA']))
(24.000000000000014,
 0.24239216167051175,
 20,
 array([[0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667,
        0.16666667],
       [0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667,
        0.16666667],
       [0.33333333, 0.33333333, 0.33333333, 0.33333333, 0.33333333,
        0.33333333],
       [0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667,
        0.16666667],
       [0.16666667, 0.16666667, 0.16666667, 0.16666667, 0.16666667,
        0.16666667]]))

#chi is the first value i.e. 24 and p_vals is second value i.e. 0.24232

对于上面的一对行索引,可以正常工作,只需替换USCUCLA

如果要对所有行执行此操作,可以在索引值上使用itertools中的combinations进行循环:

from itertools import combinations
for left, right in combinations(df.index.tolist(), 2):
    data_crosstab = pd.crosstab(df.loc[left], df.loc[right])

    #rest of the code

IIUC,你想要itertools.combinations

from itertools import combinations
for col1, col2 in combinations(df.set_index("Name").columns,2):
    #add your code here

使用combinations的结果是:

>>> list(combinations(df.set_index("Name").columns,2))
[('Research', 'Thesis'),
 ('Research', 'Proposal'),
 ('Research', 'AI'),
 ('Research', 'Analytics'),
 ('Research', 'Data'),
 ('Thesis', 'Proposal'),
 ('Thesis', 'AI'),
 ('Thesis', 'Analytics'),
 ('Thesis', 'Data'),
 ('Proposal', 'AI'),
 ('Proposal', 'Analytics'),
 ('Proposal', 'Data'),
 ('AI', 'Analytics'),
 ('AI', 'Data'),
 ('Analytics', 'Data')]

相关问题 更多 >

    热门问题