我有一个包含两列的df:
name Count_Relationship
0 allicin DOWNREGULATE: 1
1 allicin DOWNREGULATE: 2
2 allicin UPREGULATE: 1 | DOWNREGULATE: 1
3 aspirin UPREGULATE: 5 | DOWNREGULATE: 1
4 albuterol DOWNREGULATE: 1
5 albuterol UPREGULATE: 3
我只想筛选出这样的行:如果我按“名称”分组,并在“计数关系”列中计数,则下调量将大于上调量。在这种情况下,大蒜素将有DOWREGULATE 1+2+1=4和UPREGULATE=1,因此num_downregulate>;num_上调,而在其他药物(阿司匹林、沙丁胺醇)中则不是这样。 我想返回此过滤df:
name Count_Relationship
0 allicin DOWNREGULATE: 1
1 allicin DOWNREGULATE: 2
2 allicin UPREGULATE: 1 | DOWNREGULATE: 1
列Count_关系是一个字符串,因此我必须解析字符串的数字部分并将其转换为int
我试过这个:
import pandas as pd
data = {'name': ['allicin', 'allicin', 'allicin', 'aspirin', 'albuterol', 'albuterol'],
'Count_Relationship': ['DOWNREGULATE: 1', 'DOWNREGULATE: 2', 'UPREGULATE: 1 | DOWNREGULATE: 1', 'UPREGULATE: 5 | DOWNREGULATE: 1', 'DOWNREGULATE: 1' , 'UPREGULATE: 3']
}
df = pd.DataFrame(data)
substances = df["name"].tolist()
substances = list(set(substances)) # to get the unique names
result_substances = []
for substance in (substances):
try:
numberOfdownregulate = df[(df["name"] == substance) & (\
(df["Count_Relationship"].str.match(pat = '("DOWNREGULATE:"([0-9]))')).values[0].astype(int)
except:
pass
try:
numberOfupregulate = df[(df["name"] == substance) & (\
(df["Count_Relationship"].str.match(pat = '("UPREGULATE:"([0-9]))')).values[0].astype(int)
except:
pass
result = numberOfdownregulate - numberOfupregulate
if result > 0:
result_substances.append(substance)
df_filtered = df[df["name"].isin(result_substances)]
但是我在正则表达式所在的numberOfdownregulate行出现语法错误。 如何修复算法?非常感谢
我建议将下调和上调值提取到不同的列中,然后应用按名称分组的值之和,并检查哪个更大
下面的示例创建了另一个名为
UP_gt_DOWN
的布尔列,字面上是上调大于下调:您可以提取信息,比较上下,并构建一个掩码来选择数据:
输出:
相关问题 更多 >
编程相关推荐