基于数据帧中另一列的值添加列

2024-10-16 17:23:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我对python非常陌生。我遇到了这项任务,并在没有任何线索的情况下执行了一段时间。 任何建议都有帮助!非常感谢

我有这样一个数据框:

import pandas as pd
data = {'A': ['Emo/3', 'Emo/4', 'Emo/1','Emo/3', '','Emo/3', 'Emo/4', 'Emo/1','Emo/3', '', 'Neu/5', 'Neu/2','Neu/5', 'Neu/2'],
        'Pos': ["repeat3", "repeat3", "repeat3", "repeat3", '',"repeat1", "repeat1", "repeat1", "repeat1", '', "repeat2", "repeat2","repeat2", "repeat2"],
        }
df = pd.DataFrame(data)
df

    A       Pos
0   Emo/3   repeat3
1   Emo/4   repeat3
2   Emo/1   repeat3
3   Emo/3   repeat3
4       
5   Emo/3   repeat1
6   Emo/4   repeat1
7   Emo/1   repeat1
8   Emo/3   repeat1
9       
10  Neu/5   repeat2
11  Neu/2   repeat2
12  Neu/5   repeat2
13  Neu/2   repeat2

我想要这样的输出:

    A       Pos     B
0   Emo/3   repeat3 0
1   Emo/4   repeat3 0
2   Emo/1   repeat3 0
3   Emo/3   repeat3 0
4           
5   Emo/3   repeat1 1
6   Emo/4   repeat1 2
7   Emo/1   repeat1 3
8   Emo/3   repeat1 4
9           
10  Neu/5   repeat2 4
11  Neu/2   repeat2 2
12  Neu/5   repeat2 3
13  Neu/2   repeat2 1

列“B”的前四个位置始终为0。然后,“B”列中的其他位置基于“pos”列中的值。如果“pos”列中的行等于“repeat 1”,则该四个位置的“B”列将为:1、2、3、4。如果“位置”列中的行等于“重复2”,则四个位置的“B”列将为:4、3、2、1

Pos中的值始终按每四行相同的值排列,第五行为空

非常感谢


Tags: 数据posdfdata情况建议pd线索
3条回答

这是一种使用内置计数器和掩码的完全矢量化方法(步骤将在下一节中详细说明):

# create counter per section (0123401234...)
divider = df['Pos'].eq('')
section = divider.cumsum()
counter = df['Pos'].groupby(section).cumcount()

# isolate repeat1 and repeat2 sections (and flip repeat2 from 01234->43210)
rep1 = counter.where(df['Pos'].eq('repeat1'), 0)
rep2 = counter.sub(5).abs().where(df['Pos'].eq('repeat2'), 0)

# combine rep1 and rep2 (and replace divider rows with empty string)
df['B'] = rep1.add(rep2).mask(divider, '')

输出:

#         A      Pos  B
# 0   Emo/3  repeat3  0
# 1   Emo/4  repeat3  0
# 2   Emo/1  repeat3  0
# 3   Emo/3  repeat3  0
# 4                    
# 5   Emo/3  repeat1  1
# 6   Emo/4  repeat1  2
# 7   Emo/1  repeat1  3
# 8   Emo/3  repeat1  4
# 9                    
# 10  Neu/5  repeat2  4
# 11  Neu/2  repeat2  3
# 12  Neu/5  repeat2  2
# 13  Neu/2  repeat2  1

步骤

  1. 使用^{}从空行分隔符创建伪组:

    divider = df['Pos'].eq('')
    section = divider.cumsum()
    
    # 0     0
    # 1     0
    # 2     0
    # 3     0
    # 4     1
    # 5     1
    # 6     1
    # 7     1
    # 8     1
    # 9     2
    # 10    2
    # 11    2
    # 12    2
    # 13    2
    # Name: Pos, dtype: int64
    
  2. 使用^{}创建节内计数器:

    counter = df['Pos'].groupby(section).cumcount()
    
    # 0     0
    # 1     1
    # 2     2
    # 3     3
    # 4     0
    # 5     1
    # 6     2
    # 7     3
    # 8     4
    # 9     0
    # 10    1
    # 11    2
    # 12    3
    # 13    4
    # dtype: int64
    
  3. 使用^{}屏蔽除repeat1行之外的所有内容:

    rep1 = counter.where(df['Pos'].eq('repeat1'), 0)
    
    # 0     0
    # 1     0
    # 2     0
    # 3     0
    # 4     0
    # 5     1
    # 6     2
    # 7     3
    # 8     4
    # 9     0
    # 10    0
    # 11    0
    # 12    0
    # 13    0
    # dtype: int64
    
  4. 对于repeat2行,将计数器从01234->;43210(减去5并取绝对值),然后再次使用^{}掩盖所有其他内容:

    rep2 = counter.sub(5).abs().where(df['Pos'].eq('repeat2'), 0)
    
    # 0     0
    # 1     0
    # 2     0
    # 3     0
    # 4     0
    # 5     0
    # 6     0
    # 7     0
    # 8     0
    # 9     0
    # 10    4
    # 11    3
    # 12    2
    # 13    1
    # dtype: int64
    
  5. 所以现在B列是rep1 + rep2,但我们也使用^{}将所有divider行替换为空字符串:

    df['B'] = rep1.add(rep2).mask(divider, '')
    
    #         A      Pos  B
    # 0   Emo/3  repeat3  0
    # 1   Emo/4  repeat3  0
    # 2   Emo/1  repeat3  0
    # 3   Emo/3  repeat3  0
    # 4                    
    # 5   Emo/3  repeat1  1
    # 6   Emo/4  repeat1  2
    # 7   Emo/1  repeat1  3
    # 8   Emo/3  repeat1  4
    # 9                    
    # 10  Neu/5  repeat2  4
    # 11  Neu/2  repeat2  3
    # 12  Neu/5  repeat2  2
    # 13  Neu/2  repeat2  1
    

使用Pandas工具的通用解决方案

好吧,我花了一些时间才弄明白,但我想找到一个圆滑的答案,我有点喜欢这个:

import pandas as pd

data = {'A': ['Emo/3', 'Emo/4', 'Emo/1','Emo/3', '','Emo/3', 'Emo/4', 'Emo/1','Emo/3', '', 'Neu/5', 'Neu/2','Neu/5', 'Neu/2', '', 'Neu/5', 'Neu/2','Neu/5', 'Neu/2'],
        'Pos': ["repeat3", "repeat3", "repeat3", "repeat3", '',"repeat1", "repeat1", "repeat1", "repeat1", '', "repeat2", "repeat2","repeat2", "repeat2", '', "repeat2", "repeat2","repeat2", "repeat2"],
        }
df = pd.DataFrame(data)

#First we create column B and set first 4 value that are marked as repeat3 in 'Pos' column to zero
df['B']=df['Pos'].apply(lambda x: 0 if x == "repeat3" else x)

#Then we create a boolean mask for the rows where 'Pos' is equal to repeat1
mask1=df['B'].apply(lambda x: 1 if x == "repeat1"  else 0)
#Then we count how many blocks of type repeat1 we have
number_of_repeat1_blocks=int(mask1.sum()/4)
mask1=mask1.astype('bool')

#We do another mask the same for the rows where 'Pos' is equal to repeat2
mask2=df['B'].apply(lambda x: 1 if x == "repeat2"  else 0).astype('bool')
#Then we count how many blocks of type repeat1 we have
number_of_repeat2_blocks=int(mask2.sum()/4)
mask2=mask2.astype('bool')


#We define the number sequence that you want to replace in each case
#For rows matchin repeat1
repl1= [1,2,3,4]*number_of_repeat1_blocks
#For rows matching repeat2
repl2= [4,3,2,1,]*number_of_repeat2_blocks

#Finally we simply replace the matched patterns
df.loc[mask1,'B'] = repl1
df.loc[mask2,'B'] = repl2


print(df)

结果:

        A      Pos  B
0   Emo/3  repeat3  0
1   Emo/4  repeat3  0
2   Emo/1  repeat3  0
3   Emo/3  repeat3  0
4                    
5   Emo/3  repeat1  1
6   Emo/4  repeat1  2
7   Emo/1  repeat1  3
8   Emo/3  repeat1  4
9                    
10  Neu/5  repeat2  4
11  Neu/2  repeat2  3
12  Neu/5  repeat2  2
13  Neu/2  repeat2  1
14                   
15  Neu/5  repeat2  4
16  Neu/2  repeat2  3
17  Neu/5  repeat2  2
18  Neu/2  repeat2  1

解决方案

我相信有更好的方法,但这里有一种方法:

df["B"] = ""
repeat_mapping = {"repeat3": [0]*4,
                  "repeat2": [*range(4, 0, -1)],
                  "repeat1": [*range(1, 5)]}

repeats = df[::5]["Pos"].map(repeat_mapping).explode()
repeats.index += pd.Series([*range(4)]*len(df[::5]))

df["B"][repeats.index] = repeats

输出:

        A      Pos  B
0   Emo/3  repeat3  0
1   Emo/4  repeat3  0
2   Emo/1  repeat3  0
3   Emo/3  repeat3  0
4
5   Emo/3  repeat1  1
6   Emo/4  repeat1  2
7   Emo/1  repeat1  3
8   Emo/3  repeat1  4
9
10  Neu/5  repeat2  4
11  Neu/2  repeat2  3
12  Neu/5  repeat2  2
13  Neu/2  repeat2  1

台阶

准备新专栏:

In [1]: df["B"] = ""

In [2]: df
Out[2]:
        A      Pos B
0   Emo/3  repeat3
1   Emo/4  repeat3
2   Emo/1  repeat3
3   Emo/3  repeat3
4
5   Emo/3  repeat1
6   Emo/4  repeat1
7   Emo/1  repeat1
8   Emo/3  repeat1
9
10  Neu/5  repeat2
11  Neu/2  repeat2
12  Neu/5  repeat2
13  Neu/2  repeat2

抓住第五排:

In [3]: df[::5]["Pos"]
Out[3]:
0     repeat3
5     repeat1
10    repeat2
Name: Pos, dtype: object

使用repeat_mapping

In [4]: df[::5]["Pos"].map(repeat_mapping)
Out[4]:
0     [0, 0, 0, 0]
5     [1, 2, 3, 4]
10    [4, 3, 2, 1]
Name: Pos, dtype: object

分解列表:

In [5]: repeats = df[::5]["Pos"].map(repeat_mapping).explode()

In [6]: repeats
Out[6]:
0     0
0     0
0     0
0     0
5     1
5     2
5     3
5     4
10    4
10    3
10    2
10    1
Name: Pos, dtype: object

注意repeats中的每个索引都重复了4次。我们将通过将每个索引增加0, 1, 2, 3来解决这个问题:

In [7]: pd.Series([*range(4)]*len(df[::5])).values
Out[7]: array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3], dtype=int64)

In [8]: repeats.index += pd.Series([*range(4)]*len(df[::5]))

In [9]: repeats
Out[9]:
0     0
1     0
2     0
3     0
5     1
6     2
7     3
8     4
10    4
11    3
12    2
13    1
Name: Pos, dtype: object

最后,df["B"][repeats.index]只选择其索引与repeats索引匹配的行,然后将repeats的值分配给这些行

相关问题 更多 >