Python如何将字符串拆分为几个阶段

2024-06-28 19:53:55 发布

您现在位置:Python中文网/ 问答频道 /正文

比如说

str1 = 'Polyphosphate + n H2O <=> (n+1) Oligophosphate'
str2 = '16 ATP + 16 H2O + 8 Reduced ferredoxin <=> 8 e- + 16 Orthophosphate + 16 ADP + 8 Oxidized ferredoxin'

如何将str1拆分为'Polyphosphate', 'H2O', Oligophosphate',将str2拆分为'ATP' 'H2O' 'Reduced ferredoxin' 'e-' 'Orthophosphate' 'ADP' and Oxidized ferredoxin'。 谢谢大家!


Tags: andadph2oreducedstr1str2atppolyphosphate
3条回答

使用正则表达式,您可以在<=>+上拆分,以获得带有数字的单独化合物

将它们分开后,可以使用lstrip删除前面的数字(包括(n+1)等),并使用strip删除后面的空格

import re

str1 = 'Polyphosphate + n H2O <=> (n+1) Oligophosphate'
str2 = '16 ATP + 16 H2O + 8 Reduced ferredoxin <=> 8 e- + 16 Orthophosphate + 16 ADP + 8 Oxidized ferredoxin'

res1 = [i.lstrip(" 123456789n()+").strip() for i in re.split(r" \+ | <=> ", str1)]
res2 = [i.lstrip(" 123456789n()+").strip() for i in re.split(r" \+ | <=> ", str2)]

print(res1) # ['Polyphosphate', 'H2O', 'Oligophosphate']
print(res2) # ['ATP', 'H2O', 'Reduced ferredoxin', 'e-', 'Orthophosphate', 'ADP', 'Oxidized ferredoxin']

随着您不断变化的需求:

In some compound, it may also exist the number or some other char, for example, '5-Aminolevulinate' or '(+)-Bisdechlorogeodin'

下面是另一个稍微不太好的解决方案,带有一个额外复杂的示例:

import re

str1 = 'Polyphosphate + n H2O <=> (n+1) Oligophosphate'
str2 = '16 ATP + 16 H2O + 8 Reduced ferredoxin <=> 8 e- + 16 Orthophosphate + 16 ADP + 8 Oxidized ferredoxin'
str3 = '5-Aminolevulinate + 8 Reduced ferredoxin <=> 8 e- + 16 Orthophosphate + (+)-Bisdechlorogeodin + (n+1) Oligophosphate'

res1 = [re.split(r"[^a-z] ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str1)]
res2 = [re.split(r"[^a-z] ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str2)]
res3 = [re.split(r"[^a-z] ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str3)]

print(res1) # ['Polyphosphate', 'H2O', 'Oligophosphate']
print(res2) # ['ATP', 'H2O', 'Reduced ferredoxin', 'e-', 'Orthophosphate', 'ADP', 'Oxidized ferredoxin']
print(res3) # ['5-Aminolevulinate', 'Reduced ferredoxin', 'e-', 'Orthophosphate', '(+)-Bisdechlorogeodin', 'Oligophosphate']

要处理您现在已删除的评论,并满足进一步的可能要求,请执行以下操作:

During the experiment, there exist new compounds, for example ''2 GTP <=> Diphosphate + P1,P4-Bis(5'-guanosyl) tetraphosphate'', the compound is 'P1,P4-Bis(5'-guanosyl) tetraphosphate'

import re

str1 = 'Polyphosphate + n H2O <=> (n+1) Oligophosphate'
str2 = '16 ATP + 16 H2O + 8 Reduced ferredoxin <=> 8 e- + 16 Orthophosphate + 16 ADP + 8 Oxidized ferredoxin'
str3 = '5-Aminolevulinate + 8 Reduced ferredoxin <=> 8 e- + 16 Orthophosphate + (+)-Bisdechlorogeodin + (n+1) Oligophosphate'
str4 = '2 GTP <=> Diphosphate + 8 e- + 16 Orthophosphate + 12 (+)-Bisdechlorogeodin + (n+1) P1,P4-Bis(5\'-guanosyl) tetraphosphate'

res1 = [re.split(r"[^a-z\)]\)? ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str1)]
res2 = [re.split(r"[^a-z\)]\)? ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str2)]
res3 = [re.split(r"[^a-z\)]\)? ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str3)]
res4 = [re.split(r"[^a-z\)]\)? ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str4)]

print(res1) # ['Polyphosphate', 'H2O', 'Oligophosphate']
print(res2) # ['ATP', 'H2O', 'Reduced ferredoxin', 'e-', 'Orthophosphate', 'ADP', 'Oxidized ferredoxin']
print(res3) # ['5-Aminolevulinate', 'Reduced ferredoxin', 'e-', 'Orthophosphate', '(+)-Bisdechlorogeodin', 'Oligophosphate']
print(res4) # ['GTP', 'Diphosphate', 'e-', 'Orthophosphate', '(+)-Bisdechlorogeodin', "P1,P4-Bis(5'-guanosyl) tetraphosphate"]

(注意:我在公式中添加了一些任意的其他内容,以尝试确保在更多情况下生成正确的结果,同时注意,我不一定捕获了所有边缘情况,但它适用于给定的示例。)

正如您在问题中所说,jsut使用stringsplit方法:

phrases = str.split(" + ")

它输出根据作为拆分标记传递的参数划分的子字符串列表。如果未传递任何参数或“None”,它将在空格(空格、制表符或换行符)上拆分字符串

字符串的逆方法join可用于以后从此类列表中重新组合字符串:

my_str = " + ".join(phrases)

作为旁注,避免命名变量str,因为它是Python中字符串类的名称,并且会被隐藏

对于每个子字符串的进一步处理,您可以使用For循环,并在每个标记中重新应用拆分,删除您不感兴趣的标记。如果这些都是数字和<=>标记,那么这将起作用:

raw_phrases  = str.split(" + ")
phrases = []
for phrase in raw_phrases:
    filtered = []
    for token in phrase.split():
        if not token.isdigit and token not in ("<=>",) : # extend the predicate to whatever other tokens you want to filter
             filtered.append(token)
    phrases.append(" ".join(filtered))

上面的拆分、过滤和重新连接循环可以使用列表理解在一行代码中表示:

phrases = [" ".join(token for token in phrase.split(" ") if  not token.isdigit() and token not in ("<=>",) ) for phrase in str.split(" + ")]

您可以按如下方式执行此操作:

phrases = str.split(" + ")

使用短语之间的公共分隔符“+”进行拆分

仅供参考:str已被python标准库用于表示字符串类型。我建议您考虑一个不同的变量名

更新:

正如在对另一个答案的评论中所说的,这将在一些短语之前留下8。您可以通过在phrases上循环来删除它们:

phrases = [phrase.strip("8 ") for phrase in phrases]

相关问题 更多 >