使用groupby/aggregate返回多个列

2024-10-01 22:39:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个示例数据集,我想按一列分组,然后根据现有列的所有值生成4个新列。在

以下是一些示例数据:

data = {'AlignmentId': {0: u'ENSMUST00000000001.4-1',
  1: u'ENSMUST00000000001.4-1',
  2: u'ENSMUST00000000003.13-0',
  3: u'ENSMUST00000000003.13-0',
  4: u'ENSMUST00000000003.13-0'},
 'name': {0: u'NonCodingDeletion',
  1: u'NonCodingInsertion',
  2: u'CodingDeletion',
  3: u'CodingInsertion',
  4: u'NonCodingDeletion'},
 'value_CDS': {0: nan, 1: nan, 2: 1.0, 3: 1.0, 4: nan},
 'value_mRNA': {0: 21.0, 1: 26.0, 2: 1.0, 3: 1.0, 4: 2.0}}
df = pd.DataFrame.from_dict(data)

看起来像这样:

^{pr2}$

我想根据name列中值的存在与否返回布尔值,具体取决于value_CDS是否只包含空值。为此,我创建了此函数:

def aggfunc(s):
    if s.value_CDS.any():
        c = set(s.name)
    else:
        c = set(s.name)
    return ('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)

然后这样做了:

merged = df.groupby('AlignmentId').aggregate(aggfunc)

这给了我一个错误ValueError: Shape of passed values is (318, 4), indices imply (318, 3)。在

如何从groupby聚合返回多个新列?在

我想要的输出是:

ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (True, True, True, False)

我会把它放在一个5列的数据帧中。在

如果我使用.apply,则输出不正确:

ENSMUST00000000001.4-1     (False, False, False, False)
ENSMUST00000000003.13-0    (False, False, False, False)

但如果我一次只抓到一组,这是正确的:

In [380]: for aln_id, d in df.groupby('AlignmentId'):
   .....:     print aggfunc(d)
   .....:
(False, False, False, False)
(True, True, True, False)

Tags: 数据nameinfalsetrue示例dfvalue
1条回答
网友
1楼 · 发布于 2024-10-01 22:39:10

您需要将name更改为['name'],因为.name返回组的名称(列分组依据的值):

def aggfunc(s):
    if s.value_CDS.any():
        c = set(s['name'])
    else:
        c = set(s['name'])

    return ('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)

merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
AlignmentId
ENSMUST00000000001.4-1     (False, False, False, False)
ENSMUST00000000003.13-0       (True, True, True, False)
dtype: object

^{pr2}$

改进代码:

def aggfunc(s):
    #if and else return same c, so omitted
    c = set(s['name'])

    #added Series for return columns instead tuples
    cols = ['col1','col2','col3','col4']
    return pd.Series(('CodingDeletion' in c or 'CodingInsertion' in c, 
            'CodingInsertion' in c, 'CodingDeletion' in c, 
            'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c), index=cols)

merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)

                          col1   col2   col3   col4
AlignmentId                                        
ENSMUST00000000001.4-1   False  False  False  False
ENSMUST00000000003.13-0   True   True   True  False

相关问题 更多 >

    热门问题