用python进行复杂字符串过滤

2024-09-27 00:18:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个很长的字符串,这是一个系统发育树,我想做一个非常具体的过滤

(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;

基本上,每个x@y都是一个species@gene_id信息。我试图做的是减少这个,这样我将只有x而不是x@y

(Esy, Aar,(Spa,Cpl))...

我尝试先拆分字符串,但问题是字符串有不同的“拆分点”用于我想要实现的目标,即某些部分x@y,结尾,而其他部分以)结尾。我搜索了一个解决方案并看到了正则表达式操作,但我对Python还不熟悉,我不能确定这是否是我应该关注的。我还考虑过strip(),但似乎我需要为此指定要剥离的字符

主要的问题是,我并没有告诉Python遵循什么“模式”。唯一的问题是,所有物种ID都是3个字母,它们在@字符之前

有没有一种方法可以满足我的需求?如果你能帮我解决我的问题,我将非常高兴。提前谢谢


Tags: 字符串结尾字符t1spacplp1g1
3条回答

这种功能怎么样:

def parse_string(string):
    new_string = ''
    skip = False
    for char in string:
        if char == '@':
            skip = True
        if char == ',':
            skip = False
        if not skip or char in ['(', ')']:
            new_string += char
    return new_string

用字符串调用它:

string = '(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
parse_string(string)
> '(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hse)),(Esa,Aal))))'

尝试一下:

import re:

pat = re.compile(r'(\w{3})@')
txt = "(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
pat.findall(t)

结果:

['Esy', 'Aar', 'Spa', 'Cpl', 'Bst', 'Aly', 'Ath', 'Chi', 'Cru', 'Hco', 'Hlo', 'Hla', 'Hse', 'Esa', 'Aal']

如果您需要完整的结构,我们可以尝试移除不必要的部分:

pat = re.compile(r'(@|:)[^/),]*')
pat.sub('',t).replace(',', ', ')

结果:

'(Esy, Aar, ((Spa, Cpl), (((Bst, ((Aly, Ath), (Chi, Cru))), (((Hco, Hlo), Hla), Hse)), (Esa, Aal))))'

Regex demo

您可以使用正则表达式:

import re 
s = "(Esy@ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar@AA_maker7399_1:0.137507902808,((Spa@Tp2g18720:0.0318934795022,Cpl@CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst@Bostr_13083s0053_1:0.0332592496158,((Aly@AL8G21130_t1:0.0328569260951,Ath@AT5G48370_1:0.0391706378372):0.0205924636564,(Chi@CARHR183840_1:0.0954469923893,Cru@Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco@scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo@DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla@DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse@DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa@Thhalv10004228m:0.0378509854703,Aal@Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
p = "...?(?=@)|\(|\)"

result = re.findall(p, s)

您可以将结果作为列表,这样您就可以将其设置为字符串或对其执行任何操作

解释正在发生的事情:
p是正则表达式模式
所以在这个模式中:
.表示匹配任何单词
...?(?=@)意味着匹配任何单词,直到我找到一个单词?,其中?@,所以整个模式意味着在@之前可以找到任意三个单词
|or语句,我在这里使用它来查找另一个模式
剩下的就是找到)(

相关问题 更多 >

    热门问题