提取ord中用括号括起来的字符串的唯一部分

2024-10-02 18:17:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我想提取包含在方括号中的数据并将其打印到另一个文本文件中。你知道吗

我的文本文件是

RAH71880.1 phenol monooxygenase [Aspergillus aculeatinus CBS 121060] PVV21043.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata] PVV21041.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata] PYH66749.1 phenol monooxygenase [Aspergillus vadensis CBS 113365] PYH31415.1 phenol monooxygenase [Aspergillus neoniger CBS 115656] PUB86175.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata] PUB86141.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata] PUB86139.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata] PUB79626.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata] PUB79624.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata] PUB72973.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata] PUB72971.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata] PWY90296.1 phenol monooxygenase [Aspergillus sclerotioniger CBS 115572] PWY63616.1 phenol monooxygenase [Aspergillus eucalypticola CBS 122712]

我用过这个程序

infile = open('out3.txt', 'r')
outfile = open('out5.txt', 'w')
for l in infile:
    outfile.write(l.split()[-1] + '\n')
infile.close()
outfile.close()

但它不起作用


Tags: ofopeninfileoutfilecbs文本文件gammamonooxygenase
3条回答

下面是一个正则表达式解决方案,它工作并保留[ ]。 正则表达式:r'(\[.+\])'。你知道吗

前导的r表示原始字符串,它阻止python插入\\字符。你知道吗

外圆括号( )是一个捕获组,它将捕获到m.groups()返回的元组中。你知道吗

[必须“转义”,因为它们是正则表达式元字符。你知道吗

.+表示任何字符(.)的一个或多个(+

编辑:此版本使用OrderedDict删除重复项并保留顺序(而set不会这样做):

import re
from collections import OrderedDict
uniq = OrderedDict()

with open('gash.txt') as inf:
    for line in inf:
       m = re.search(r'(\[.+\])', line)
       if m:
           uniq[m.groups()[0]] = None

with open('out5.txt', 'w') as outf:
    print("\n".join(uniq.keys()), file=outf)

在out5.txt中给出:

[Aspergillus aculeatinus CBS 121060]
[gamma proteobacterium symbiont of Ctena orbiculata]
[Aspergillus vadensis CBS 113365]
[Aspergillus neoniger CBS 115656]
[Aspergillus sclerotioniger CBS 115572]
[Aspergillus eucalypticola CBS 122712]

这应该完全符合您的要求:

infile = open('out3.txt', 'r')
outfile = open('out5.txt', 'w')

for line in infile:
    line = (line[line.find('[') + 1:])[:-2] + "\n"
    outfile.write(line)


infile.close()
outfile.close()

out3.txt

RAH71880.1 phenol monooxygenase [Aspergillus aculeatinus CBS 121060]
PVV21043.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PVV21041.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PYH66749.1 phenol monooxygenase [Aspergillus vadensis CBS 113365]
PYH31415.1 phenol monooxygenase [Aspergillus neoniger CBS 115656]
PUB86175.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB86141.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB86139.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB79626.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB79624.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB72973.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB72971.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PWY90296.1 phenol monooxygenase [Aspergillus sclerotioniger CBS 115572]
PWY63616.1 phenol monooxygenase [Aspergillus eucalypticola CBS 122712]

out5.txt

Aspergillus aculeatinus CBS 121060
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
Aspergillus vadensis CBS 113365
Aspergillus neoniger CBS 115656
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
Aspergillus sclerotioniger CBS 115572
Aspergillus eucalypticola CBS 122712

编辑

如果您只想打印出唯一的行,可以这样更新源代码:

infile = open('out3.txt', 'r')
outfile = open('out5.txt', 'w')
unique = []

for line in infile:
    line = (line[line.find('[') + 1:])[:-2] + "\n"

    if line not in unique:
        unique.append(line)
        outfile.write(line)


infile.close()
outfile.close()

然后您将得到如下输出(out5.txt):

Aspergillus aculeatinus CBS 121060
gamma proteobacterium symbiont of Ctena orbiculata
Aspergillus vadensis CBS 113365
Aspergillus neoniger CBS 115656
Aspergillus sclerotioniger CBS 115572
Aspergillus eucalypticola CBS 122712

你想在你的程序中使用正则表达式。 正则表达式对于提取文本非常有用。 例如:

   import re

   s = "alphaCustomer bla bla bla [dataFindMe] bla bla bla"
   m = re.search(r"\[(\.+)\]", s)
   print m.group(1)

输出

   dataFindMe

相关问题 更多 >