如何使用正则表达式解析列值以将字符串提取为int

name Count_Relationship 0 allicin DOWNREGULATE: 1 1 allicin DOWNREGULATE: 2 2 allicin UPREGULATE: 1 | DOWNREGULATE: 1 3 aspirin UPREGULATE: 5 | DOWNREGULATE: 1 4 albuterol DOWNREGULATE: 1 5 albuterol UPREGULATE: 3

import pandas as pd data = {'name': ['allicin', 'allicin', 'allicin', 'aspirin', 'albuterol', 'albuterol'], 'Count_Relationship': ['DOWNREGULATE: 1', 'DOWNREGULATE: 2', 'UPREGULATE: 1 | DOWNREGULATE: 1', 'UPREGULATE: 5 | DOWNREGULATE: 1', 'DOWNREGULATE: 1' , 'UPREGULATE: 3'] } df = pd.DataFrame(data) substances = df["name"].tolist() substances = list(set(substances)) # to get the unique names result_substances = [] for substance in (substances): try: numberOfdownregulate = df[(df["name"] == substance) & (\ (df["Count_Relationship"].str.match(pat = '("DOWNREGULATE:"([0-9]))')).values[0].astype(int) except: pass try: numberOfupregulate = df[(df["name"] == substance) & (\ (df["Count_Relationship"].str.match(pat = '("UPREGULATE:"([0-9]))')).values[0].astype(int) except: pass result = numberOfdownregulate - numberOfupregulate if result > 0: result_substances.append(substance) df_filtered = df[df["name"].isin(result_substances)]

2条回答

网友

1楼 · 编辑于 2024-10-05 14:27:25

我建议将下调和上调值提取到不同的列中，然后应用按名称分组的值之和，并检查哪个更大

下面的示例创建了另一个名为UP_gt_DOWN的布尔列，字面上是上调大于下调：

df['UPREGULATE'] = df['Count_Relationship'].str.extract(r"UPREGULATE: (\d*)").fillna(0).astype(int)
df['DOWNREGULATE'] = df['Count_Relationship'].str.extract(r"DOWNREGULATE: (\d*)").fillna(0).astype(int)

summed_df = df.groupby('name').sum()
summed_df['UP_gt_DOWN'] = summed_df['UPREGULATE'] > summed_df['DOWNREGULATE']
print(summed_df)

# Output
#            UPREGULATE  DOWNREGULATE  UP_gt_DOWN
# name                                           
# albuterol           3             1        True
# allicin             1             4       False
# aspirin             5             1        True

filtered_drugs = summed_df[~summed_df['UP_gt_DOWN']].index
print(df[df['name'].isin(filtered_drugs)])

# Output
#       name               Count_Relationship  UPREGULATE  DOWNREGULATE
# 0  allicin                  DOWNREGULATE: 1           0             1
# 1  allicin                  DOWNREGULATE: 2           0             2
# 2  allicin  UPREGULATE: 1 | DOWNREGULATE: 1           1             1

网友

2楼 · 编辑于 2024-10-05 14:27:25

您可以提取信息，比较上下，并构建一个掩码来选择数据：

drugs = (df.join(df['Count_Relationship'].str.extractall('(?P<down>(?<=DOWNREGULATE: )\d+)|(?P<up>(?<=UPREGULATE: )\d+)')
                   .groupby(level=0).first().fillna(0).astype(int)
                 )
           .groupby('name').agg({'down': 'sum', 'up': 'sum'})
           .query('down >= up')
           .index
        )

df[df['name'].isin(drugs)]

输出：

      name               Count_Relationship
0  allicin                  DOWNREGULATE: 1
1  allicin                  DOWNREGULATE: 2
2  allicin  UPREGULATE: 1 | DOWNREGULATE: 1

相关问题更多 >

编程相关推荐

热门问题

热门文章