Python如何将字符串拆分为几个阶段

3条回答

网友

1楼 · 编辑于 2024-06-28 19:53:55

使用正则表达式，您可以在<=>或+上拆分，以获得带有数字的单独化合物

将它们分开后，可以使用lstrip删除前面的数字（包括(n+1)等），并使用strip删除后面的空格

import re

str1 = 'Polyphosphate + n H2O <=> (n+1) Oligophosphate'
str2 = '16 ATP + 16 H2O + 8 Reduced ferredoxin <=> 8 e- + 16 Orthophosphate + 16 ADP + 8 Oxidized ferredoxin'

res1 = [i.lstrip(" 123456789n()+").strip() for i in re.split(r" \+ | <=> ", str1)]
res2 = [i.lstrip(" 123456789n()+").strip() for i in re.split(r" \+ | <=> ", str2)]

print(res1) # ['Polyphosphate', 'H2O', 'Oligophosphate']
print(res2) # ['ATP', 'H2O', 'Reduced ferredoxin', 'e-', 'Orthophosphate', 'ADP', 'Oxidized ferredoxin']

随着您不断变化的需求：

In some compound, it may also exist the number or some other char, for example, '5-Aminolevulinate' or '(+)-Bisdechlorogeodin'

下面是另一个稍微不太好的解决方案，带有一个额外复杂的示例：

import re

str1 = 'Polyphosphate + n H2O <=> (n+1) Oligophosphate'
str2 = '16 ATP + 16 H2O + 8 Reduced ferredoxin <=> 8 e- + 16 Orthophosphate + 16 ADP + 8 Oxidized ferredoxin'
str3 = '5-Aminolevulinate + 8 Reduced ferredoxin <=> 8 e- + 16 Orthophosphate + (+)-Bisdechlorogeodin + (n+1) Oligophosphate'

res1 = [re.split(r"[^a-z] ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str1)]
res2 = [re.split(r"[^a-z] ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str2)]
res3 = [re.split(r"[^a-z] ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str3)]

print(res1) # ['Polyphosphate', 'H2O', 'Oligophosphate']
print(res2) # ['ATP', 'H2O', 'Reduced ferredoxin', 'e-', 'Orthophosphate', 'ADP', 'Oxidized ferredoxin']
print(res3) # ['5-Aminolevulinate', 'Reduced ferredoxin', 'e-', 'Orthophosphate', '(+)-Bisdechlorogeodin', 'Oligophosphate']

要处理您现在已删除的评论，并满足进一步的可能要求，请执行以下操作：

During the experiment, there exist new compounds, for example ''2 GTP <=> Diphosphate + P1,P4-Bis(5'-guanosyl) tetraphosphate'', the compound is 'P1,P4-Bis(5'-guanosyl) tetraphosphate'

import re

str1 = 'Polyphosphate + n H2O <=> (n+1) Oligophosphate'
str2 = '16 ATP + 16 H2O + 8 Reduced ferredoxin <=> 8 e- + 16 Orthophosphate + 16 ADP + 8 Oxidized ferredoxin'
str3 = '5-Aminolevulinate + 8 Reduced ferredoxin <=> 8 e- + 16 Orthophosphate + (+)-Bisdechlorogeodin + (n+1) Oligophosphate'
str4 = '2 GTP <=> Diphosphate + 8 e- + 16 Orthophosphate + 12 (+)-Bisdechlorogeodin + (n+1) P1,P4-Bis(5\'-guanosyl) tetraphosphate'

res1 = [re.split(r"[^a-z\)]\)? ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str1)]
res2 = [re.split(r"[^a-z\)]\)? ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str2)]
res3 = [re.split(r"[^a-z\)]\)? ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str3)]
res4 = [re.split(r"[^a-z\)]\)? ", i)[-1].lstrip("n ").strip() for i in re.split(r" \+ | <=> ", str4)]

print(res1) # ['Polyphosphate', 'H2O', 'Oligophosphate']
print(res2) # ['ATP', 'H2O', 'Reduced ferredoxin', 'e-', 'Orthophosphate', 'ADP', 'Oxidized ferredoxin']
print(res3) # ['5-Aminolevulinate', 'Reduced ferredoxin', 'e-', 'Orthophosphate', '(+)-Bisdechlorogeodin', 'Oligophosphate']
print(res4) # ['GTP', 'Diphosphate', 'e-', 'Orthophosphate', '(+)-Bisdechlorogeodin', "P1,P4-Bis(5'-guanosyl) tetraphosphate"]

（注意：我在公式中添加了一些任意的其他内容，以尝试确保在更多情况下生成正确的结果，同时注意，我不一定捕获了所有边缘情况，但它适用于给定的示例。）

网友

2楼 · 编辑于 2024-06-28 19:53:55

正如您在问题中所说，jsut使用stringsplit方法：

phrases = str.split(" + ")

它输出根据作为拆分标记传递的参数划分的子字符串列表。如果未传递任何参数或“None”，它将在空格（空格、制表符或换行符）上拆分字符串

字符串的逆方法join可用于以后从此类列表中重新组合字符串：

my_str = " + ".join(phrases)

作为旁注，避免命名变量str，因为它是Python中字符串类的名称，并且会被隐藏

对于每个子字符串的进一步处理，您可以使用For循环，并在每个标记中重新应用拆分，删除您不感兴趣的标记。如果这些都是数字和<=>标记，那么这将起作用：

raw_phrases  = str.split(" + ")
phrases = []
for phrase in raw_phrases:
    filtered = []
    for token in phrase.split():
        if not token.isdigit and token not in ("<=>",) : # extend the predicate to whatever other tokens you want to filter
             filtered.append(token)
    phrases.append(" ".join(filtered))

上面的拆分、过滤和重新连接循环可以使用列表理解在一行代码中表示：

phrases = [" ".join(token for token in phrase.split(" ") if  not token.isdigit() and token not in ("<=>",) ) for phrase in str.split(" + ")]

网友
                    
                    

                    

                    3楼 ·

                    
                        编辑于 2024-06-28 19:53:55

您可以按如下方式执行此操作：

phrases = str.split(" + ")

使用短语之间的公共分隔符“+”进行拆分

仅供参考：str已被python标准库用于表示字符串类型。我建议您考虑一个不同的变量名

更新：

正如在对另一个答案的评论中所说的，这将在一些短语之前留下8。您可以通过在phrases上循环来删除它们：

phrases = [phrase.strip("8 ") for phrase in phrases]

`相关问题更多 >`

`编程相关推荐`

`热门问题`

`热门文章`