如何在Python中检查列表元素是否包含正则表达式?

2024-10-02 12:33:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个关于序列的文件。每个序列都有一些行。序列由五条白线隔开。我想把这个文件改成一个列表,然后按5行换行。所以我有一个列表,每个序列都是一个元素。然后我想删除不包含正则表达式的序列。最后,我想要一个列表,只包含包含正则表达式的序列。你知道吗

现在我有了这个。有人能帮我进一步吗?你知道吗

import re
def main():
    ReadFile()
    file = open ("filename.txt", "r")
    CreateList(file, data)
    RegEx(file, data)

def ReadFile()
    try:
        file = open ("filename.txt", "r")
    except IOError:
        print ("Can't open the file")
    except:
        print ("Something went wrong.")

def CreateList(file, data)
    data = file.readlines()
    data = data.split('\n\n\n\n\n')

def RegEx(file, data)
    regex = ("[AG].{4}GK[ST]") 
    for element in data:
        if regex not in element: 
            data.remove(element) 
    print (data) 

main()

文件看起来像:

Hits for PS00017|ATP_GTP_A (pattern) ATP/GTP-binding site motif A (P-loop) :  [occurs frequently]
   Pattern: [AG]-x(4)-G-K-[ST]
   Approximate number of expected random matches in ~ 100'000 sequences (50'000'000 residues): 3371


>sp|Q6GZX2|003R_FRG3G  (438 aa)
Uncharacterized protein 3R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSD
FKTVLGSALLAVERDMVHVVPKYLQTPGILHDMLVLLTPIFGEALSVDMSGATDVMVQQIATAGFVDVDPLHSSVSWKDN
VSCPVALLAVSNAVRTMMGQPCQVTLIIDVGTQNILRDLVNLPVEMSGDLQVMAYTKDPLGKVPAVGVSVFDSGSVQKGD
AHSVGAPDGLVSFHTHPVSSAVELNYHAGWPSNVDMSSLLTMKNLMHVVVAEEGLWTMARTLSMQRLTKVLTDAEKDVMR
AAAFNLFLPLNELRVMGTKDSNNKSLKTYFEVFETFTIGALMKHSGVTPTAFVDRRWLDNTIYHMGFIPWGRDMRFVVEY
DLDGTNPFLNTVPTLMSVKRKAKIQEMFDNMVSRMVTS
      2 - 9:          ArpllGKT


>sp|Q6GZX1|004R_FRG3G  (60 aa)
Uncharacterized protein 004R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
      33 - 40:        GyyydGKT


>sp|Q6GZW0|015R_FRG3G  (322 aa)
Uncharacterized protein 015R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGKSTLIKALLDSKRHIIPCAVVISGSEEANGFYKGVVPDLFI
YHQFSPSIIDRIHRRQVKAKAEMGSKKSWLLVVIDDCMDNAKMFNDKEVRALFKNGRHWNVLVVIANQYVMDLTPDLRSS
VDGVFLFRENNVTYRDKTYANFASVVPKKLYPTVMETVCQNYRCMFIDNTKATDNWHDSVFWYKAPYSKSAVAPFGARSY
WKYACSKTGEEMPAVFDNVKILGDLLLKELPEAGEALVTYGGKDGPSDNEDGPSDDEDGPSDDEEGLSKDGVSEYYQSDL
DD
      34 - 41:        GkpgsGKS',


>sp|P32234|128UP_DROME  (368 aa)
GTP-binding protein 128up.  [Drosophila melanogaster (Fruit fly)]
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVAKTGDARVGFVGFPSVGKSTL
LSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPGIIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKL
LEHELEGFGIRLNKKPPNIYYKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKGQLPDYNSPVVLHNERTSIED
FCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDEDVVQIVKKV
      71 - 78:        GfpsvGKS

数据应该是(但仅包含RegEx的蛋白质):

['>sp|Q6GZX2|003R_FRG3G  (438 aa)
Uncharacterized protein 3R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSD
FKTVLGSALLAVERDMVHVVPKYLQTPGILHDMLVLLTPIFGEALSVDMSGATDVMVQQIATAGFVDVDPLHSSVSWKDN
VSCPVALLAVSNAVRTMMGQPCQVTLIIDVGTQNILRDLVNLPVEMSGDLQVMAYTKDPLGKVPAVGVSVFDSGSVQKGD
AHSVGAPDGLVSFHTHPVSSAVELNYHAGWPSNVDMSSLLTMKNLMHVVVAEEGLWTMARTLSMQRLTKVLTDAEKDVMR
AAAFNLFLPLNELRVMGTKDSNNKSLKTYFEVFETFTIGALMKHSGVTPTAFVDRRWLDNTIYHMGFIPWGRDMRFVVEY
DLDGTNPFLNTVPTLMSVKRKAKIQEMFDNMVSRMVTS
      2 - 9:          ArpllGKT',


'>sp|Q6GZX1|004R_FRG3G  (60 aa)
Uncharacterized protein 004R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
      33 - 40:        GyyydGKT',


'>sp|Q6GZW0|015R_FRG3G  (322 aa)
Uncharacterized protein 015R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGKSTLIKALLDSKRHIIPCAVVISGSEEANGFYKGVVPDLFI
YHQFSPSIIDRIHRRQVKAKAEMGSKKSWLLVVIDDCMDNAKMFNDKEVRALFKNGRHWNVLVVIANQYVMDLTPDLRSS
VDGVFLFRENNVTYRDKTYANFASVVPKKLYPTVMETVCQNYRCMFIDNTKATDNWHDSVFWYKAPYSKSAVAPFGARSY
WKYACSKTGEEMPAVFDNVKILGDLLLKELPEAGEALVTYGGKDGPSDNEDGPSDDEDGPSDDEEGLSKDGVSEYYQSDL
DD
      34 - 41:        GkpgsGKS',


'>sp|P32234|128UP_DROME  (368 aa)
GTP-binding protein 128up.  [Drosophila melanogaster (Fruit fly)]
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVAKTGDARVGFVGFPSVGKSTL
LSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPGIIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKL
LEHELEGFGIRLNKKPPNIYYKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKGQLPDYNSPVVLHNERTSIED
FCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDEDVVQIVKKV
      71 - 78:        GfpsvGKS']

Tags: datadef序列spfileaaproteinisolate
1条回答
网友
1楼 · 发布于 2024-10-02 12:33:06
import re
file = open("ploop.txt")
text = file.read()
file.close()

proteins = text.split("\n\n")[1:]
proteinsMatching = []
toWrite = "" 

for protein in proteins:
    if re.search(r"[AG].{4}GK[ST]", protein):
        proteinsMatching.append(protein)        


for protein in proteinsMatching:
    accensionCode = re.findall(r">sp\|(.{6})", protein)[0]
    organism = re.findall(r"\n.+?\[(.+?)\]", protein)[0]
    print(accensionCode, organism)
    toWrite += accensionCode + " " + organism + "\n"

f = open("results.txt", "w+")
f.write(toWrite)
f.close()

# Q6GZX2 Frog virus 3 (isolate Goorha) (FV-3)
# Q6GZX1 Frog virus 3 (isolate Goorha) (FV-3)
# Q6GZW0 Frog virus 3 (isolate Goorha) (FV-3)
# P32234 Drosophila melanogaster (Fruit fly)

更新(再次)以满足新的要求

Regex1(将文本文件拆分为蛋白质列表:)https://regex101.com/r/gU0gX5/1

Regex2(显示它们都匹配的regex)https://regex101.com/r/nZ0pD6/1

相关问题 更多 >

    热门问题