使用组行创建新的数据帧

2024-06-28 11:07:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我想提取组中每个数据帧的行,并从中创建新的数据帧,这样一个新的数据帧只包含组的第一行,另一个新的数据帧包含第二行,另一个包含第三行,依此类推。。例如,我的数据帧是:

raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
    'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
    'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
    'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'name', 'preTestScore', 'postTestScore'])
df

      regiment      name  preTestScore  postTestScore
0   Nighthawks    Miller             4             25
1   Nighthawks  Jacobson            24             94
2   Nighthawks       Ali            31             57
3   Nighthawks    Milner             2             62
4     Dragoons     Cooze             3             70
5     Dragoons     Jacon             4             25
6     Dragoons    Ryaner            24             94
7     Dragoons      Sone            31             57
8       Scouts     Sloan             2             62
9       Scouts     Piger             3             70
10      Scouts     Riani             2             62
11      Scouts       Ali             3             70

我把它归为:

gb = df.groupby("regiment")

   regiment   name  preTestScore  postTestScore
8    Scouts  Sloan             2             62
9    Scouts  Piger             3             70
10   Scouts  Riani             2             62
11   Scouts    Ali             3             70
------------------
     regiment      name  preTestScore  postTestScore
0  Nighthawks    Miller             4             25
1  Nighthawks  Jacobson            24             94
2  Nighthawks       Ali            31             57
3  Nighthawks    Milner             2             62
------------------
   regiment    name  preTestScore  postTestScore
4  Dragoons   Cooze             3             70
5  Dragoons   Jacon             4             25
6  Dragoons  Ryaner            24             94
7  Dragoons    Sone            31             57
------------------

我想创建数据帧,例如:

具有第一行的数据帧:

    regiment        name         preTestScore  postTestScore
8    Scouts        Sloan              2             62
0    Nighthawks    Miller             4             25
4    Dragoons      Cooze              3             70

具有第二行的数据帧:

   regiment          name        preTestScore  postTestScore
9    Scouts         Piger             3             70
1    Nighthawks    Jacobson           24            94
5    Dragoons       Jacon             4             25

等等。你知道吗

我想用组。应用()但我不太确定。你知道吗

非常感谢!你知道吗


Tags: 数据namealimillerscoutsposttestscorepretestscoreregiment
3条回答

您可能可以使用嵌套的groupbycumcount来实现这一点,例如,这将对所有第一次出现的团、所有第二次出现的团进行分组,等等:

In []:
[g for _, g in df.groupby(df.groupby('regiment').cumcount())]

Out[]:
[     regiment    name  preTestScore  postTestScore
 0  Nighthawks  Miller             4             25
 4    Dragoons   Cooze             3             70
 8      Scouts   Sloan             2             62,
      regiment      name  preTestScore  postTestScore
 1  Nighthawks  Jacobson            24             94
 5    Dragoons     Jacon             4             25
 9      Scouts     Piger             3             70,
       regiment    name  preTestScore  postTestScore
 2   Nighthawks     Ali            31             57
 6     Dragoons  Ryaner            24             94
 10      Scouts   Riani             2             62,
       regiment    name  preTestScore  postTestScore
 3   Nighthawks  Milner             2             62
 7     Dragoons    Sone            31             57
 11      Scouts     Ali             3             70]

groupby在自定义索引上,使用dicts存储

In [67]: {x:g for x,g in df.sort_values(by='regiment',ascending=False).groupby(df.index%4)}
Out[67]:
{0:      regiment    name  preTestScore  postTestScore
 8      Scouts   Sloan             2             62
 0  Nighthawks  Miller             4             25
 4    Dragoons   Cooze             3             70,
 1:      regiment      name  preTestScore  postTestScore
 9      Scouts     Piger             3             70
 1  Nighthawks  Jacobson            24             94
 5    Dragoons     Jacon             4             25,
 2:       regiment    name  preTestScore  postTestScore
 10      Scouts   Riani             2             62
 2   Nighthawks     Ali            31             57
 6     Dragoons  Ryaner            24             94,
 3:       regiment    name  preTestScore  postTestScore
 11      Scouts     Ali             3             70
 3   Nighthawks  Milner             2             62
 7     Dragoons    Sone            31             57}

list

In [71]: grps = [g for _,g in (df.sort_values(by='regiment',ascending=False)
                                 .groupby(df.index%4))]

In [72]: grps[0]
Out[72]:
     regiment    name  preTestScore  postTestScore
8      Scouts   Sloan             2             62
0  Nighthawks  Miller             4             25
4    Dragoons   Cooze             3             70

In [73]: grps[1]
Out[73]:
     regiment      name  preTestScore  postTestScore
9      Scouts     Piger             3             70
1  Nighthawks  Jacobson            24             94
5    Dragoons     Jacon             4             25

词典当然是无序的。假设每个团的样本数据只有四行,这里是前四行的排名,它使用了nth上的groupby。结果是使用字典理解来创建的,遍历范围4(0,1,2,3),获取该值的nth行,并将该值转换回其序号名称(例如,0等于'first')。你知道吗

d = {n: ordinal for n, ordinal in zip(
             range(5), ['first', 'second', 'third', 'fourth', 'fifth'])}

top_n = 4
>>> {d[n]: df.groupby(['regiment']).nth(n) for n in range(top_n)}
{'first':               name  postTestScore  preTestScore
 regiment                                       
 Dragoons     Cooze             70             3
 Nighthawks  Miller             25             4
 Scouts       Sloan             62             2,
 'fourth':               name  postTestScore  preTestScore
 regiment                                       
 Dragoons      Sone             57            31
 Nighthawks  Milner             62             2
 Scouts         Ali             70             3,
 'second':                 name  postTestScore  preTestScore
 regiment                                         
 Dragoons       Jacon             25             4
 Nighthawks  Jacobson             94            24
 Scouts         Piger             70             3,
 'third':               name  postTestScore  preTestScore
 regiment                                       
 Dragoons    Ryaner             94            24
 Nighthawks     Ali             57            31
 Scouts       Riani             62             2}

对于不同长度的行:

df = df.iloc[1:-1, :]  # Drop first and last row.
>>> {d[n]: df.groupby(['regiment']).nth(n).reindex(sorted(df.regiment.unique())) 
     for n in range(top_n)}
{'first':                 name  postTestScore  preTestScore
 regiment                                         
 Dragoons       Cooze             70             3
 Nighthawks  Jacobson             94            24
 Scouts         Sloan             62             2,
 'fourth':             name  postTestScore  preTestScore
 regiment                                     
 Dragoons    Sone             57            31
 Nighthawks   NaN            NaN           NaN
 Scouts       NaN            NaN           NaN,
 'second':              name  postTestScore  preTestScore
 regiment                                      
 Dragoons    Jacon             25             4
 Nighthawks    Ali             57            31
 Scouts      Piger             70             3,
 'third':               name  postTestScore  preTestScore
 regiment                                       
 Dragoons    Ryaner             94            24
 Nighthawks  Milner             62             2
 Scouts       Riani             62             2}

相关问题 更多 >