Pandas:不同大小数据帧之间的复杂映射

2024-09-28 03:18:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要映射两个完全不同的数据帧(感谢生物学)。所有关于pandas的教程都是简单得多的转换,如果没有4个嵌套循环,我就无法解决这个问题(真正的新手)。真的很好奇一个Python的方式来解决这个问题,而不必回到Excel。你知道吗

第一个类似于df1。对a-j分类中数千个基因的0和1的观察。你知道吗

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list('abcdefghij'), index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])

print(df1)

        a  b  c  d  e  f  g  h  i  j
gene1   1  0  1  0  1  0  1  1  1  0
gene2   0  1  0  0  0  0  0  0  1  0
gene3   0  1  1  1  1  1  0  0  0  0
gene4   1  0  1  0  0  1  0  1  1  1
gene5   0  0  1  0  0  0  0  0  0  0
gene6   0  1  0  0  1  0  1  0  1  0
gene7   1  1  0  1  1  0  0  0  1  0
gene8   0  0  0  1  1  1  1  0  1  0
gene9   1  0  1  0  1  0  1  1  0  1
gene10  1  0  0  0  1  0  1  0  1  1

第二个是类似于df2的东西。高级类别(X-W)对低级类别的映射。这个女孩有NAN而且没有索引。你知道吗

df2 = pd.DataFrame({'X': ['a','NaN','NaN','NaN'],
                       'Y': ['d', 'b', 'c','f'],
                       'Z':['g', 'h','e','NaN'],
                       'W': ['i', 'j','NaN','Nan']},index=None)

print(df2)

     W    X  Y    Z
0    i    a  d    g
1    j  NaN  b    h
2  NaN  NaN  c    e
3  Nan  NaN  f  NaN

我需要的是结果1。还有一件棘手的事。例如,gene4在i和j类别中,并且都在W类别中,但是我仍然希望result1.loc['gene4','W']中有一个'1'。最终结果仍然需要是二进制的。你知道吗

result1 = pd.DataFrame({'X': ['1','0','0','1','0','0','1','0','1','1'],
                   'Y': ['1','1','1','1','1','1','1','1','1','0'],
                   'Z': ['1','0','1','1','0','1','1','1','1','1'],
                   'W': ['1','1','0','1','0','1','1','1','1','1']}, index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])
print(result1)


        W  X  Y  Z
gene1   1  1  1  1
gene2   1  0  1  0
gene3   0  0  1  1
gene4   1  1  1  1
gene5   0  0  1  0
gene6   1  0  1  1
gene7   1  1  1  1
gene8   1  0  1  1
gene9   1  1  1  1
gene10  1  1  0  1

这可能是另一种可能的结果格式。[以实际预期结果更新]。如果有人想教他们两个(或一个简单的相互转换),更多的额外赞赏,科学也很感激。你知道吗

result1 = pd.DataFrame({'1': ['gene1','gene1','gene1','gene1'],
                       '2': ['gene2','gene4','gene2','gene3'],
                       '3': ['gene4','gene7','gene3','gene4'],
                       '4': ['gene6','gene9','gene4','gene6'],
                       '5': ['gene7','gene10','gene5','gene7'],
                       '6': ['gene8','NaN','gene6','gene8'],
                       '7': ['gene9','NaN','gene7','gene9'],
                       '8': ['gene10','NaN','gene8','gene10'],
                       '9': ['NaN','NaN','gene9','NaN'],
                       },
                       index = ['W','X','Y','Z'])
print(result1)

       1      2      3      4       5      6      7       8      9
W  gene1  gene2  gene4  gene6   gene7  gene8  gene9  gene10    NaN
X  gene1  gene4  gene7  gene9  gene10    NaN    NaN     NaN    NaN
Y  gene1  gene2  gene3  gene4   gene5  gene6  gene7   gene8  gene9
Z  gene1  gene3  gene4  gene6   gene7  gene8  gene9  gene10    NaN

非常感谢您耐心地阅读这个长问题。你知道吗


Tags: dataframeindexnanpdresult1gene1gene2gene3
1条回答
网友
1楼 · 发布于 2024-09-28 03:18:58

开始了!让我们试试这个。你知道吗

df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list('abcdefghij'), index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])

df2 = pd.DataFrame({'X': ['a','NaN','NaN','NaN'],
                       'Y': ['d', 'b', 'c','f'],
                       'Z':['g', 'h','e','NaN'],
                       'W': ['i', 'j','NaN','NaN']},index=None)

df2 = df2.replace('NaN',np.nan)

gmap = df2.stack().reset_index().drop('level_0',axis=1).set_index(0)['level_1']

df3 = df1.stack().replace(0,np.nan).dropna().reset_index(level=1)['level_1'].map(gmap).reset_index().drop_duplicates()

df_out = df3.groupby(['index','level_1'])['level_1'].count().unstack()

print(df_out)

输出:

level_1    W    X    Y    Z
index                      
gene1    1.0  NaN  NaN  NaN
gene10   1.0  1.0  1.0  1.0
gene2    1.0  1.0  1.0  1.0
gene3    1.0  1.0  1.0  1.0
gene4    1.0  NaN  1.0  1.0
gene5    1.0  NaN  1.0  NaN
gene6    1.0  1.0  1.0  1.0
gene7    NaN  1.0  1.0  1.0
gene8    NaN  NaN  1.0  1.0
gene9    1.0  NaN  NaN  1.0

编辑以获取可选输出

df1 = pd.DataFrame(np.random.randint(0,2,size =(10,10)),columns=list('abcdefghij'), index = ['gene1','gene2','gene3','gene4','gene5','gene6','gene7','gene8','gene9','gene10'])

df2 = pd.DataFrame({'X': ['a','NaN','NaN','NaN'],
                       'Y': ['d', 'b', 'c','f'],
                       'Z':['g', 'h','e','NaN'],
                       'W': ['i', 'j','NaN','NaN']},index=None)

df2 = df2.replace('NaN',np.nan)

gmap = df2.stack().reset_index().drop('level_0',axis=1).set_index(0)['level_1']

df3 = df1.stack().replace(0,np.nan).dropna().reset_index(level=1)['level_1'].map(gmap).reset_index().drop_duplicates()

df3['cols'] = df3['index'].str.split('gene').str[1].astype(int)

df_out2 = df3.set_index(['level_1','cols'])['index'].unstack()

输出:

cols        1      2      3      4      5      6      7      8      9       10
level_1                                                                       
W        gene1  gene2  gene3  gene4  gene5   None  gene7  gene8  gene9  gene10
X         None   None  gene3   None  gene5   None   None  gene8  gene9  gene10
Y        gene1  gene2  gene3  gene4  gene5  gene6  gene7  gene8  gene9  gene10
Z         None  gene2   None  gene4   None  gene6   None  gene8  gene9    None

相关问题 更多 >

    热门问题