我需要从两个函数(ZoekAccesieCode+zoekorganime)中创建一个字典。函数ZoekAccesieCode返回类似于“Q6GZX2”的行和类似于“frogvirus3(isolategoorha)”的zoekorgisme。ZoekAccesieCode需要是关键,zoekorganime需要是价值。这是我的密码:
import re
file = open("ploop.txt")
text = file.read()
file.close()
def main():
hits = VindHits()
accesie = ZoekAccesieCode(hits)
organisme = ZoekOrganisme(hits, accesie)
MaakDict(accesie, organisme)
def VindHits():
eiwitten = text.split("\n\n")[1:]
eiwitHits = []
for eiwit in eiwitten:
if re.search(r"[AG].{4}GK[ST]", eiwit):
eiwitHits.append(eiwit)
return(eiwitHits)
def ZoekAccesieCode(hits):
for eiwit in hits:
accesieCode = re.findall(r">sp\|(.{6})", eiwit)[0]
return accesieCode
def ZoekOrganisme(hits, accesie):
for eiwit in hits:
organisme = re.findall(r"\n.+?\[(.+?)\]", eiwit)[0]
return organisme
def MaakDict(accesie, organisme):
main()
文件中的一些示例数据:
Hits for PS00017|ATP_GTP_A (pattern) ATP/GTP-binding site motif A (P-loop) : [occurs frequently]
Pattern: [AG]-x(4)-G-K-[ST]
Approximate number of expected random matches in ~ 100'000 sequences (50'000'000 residues): 3371
>sp|Q6GZX2|003R_FRG3G (438 aa)
Uncharacterized protein 3R. [Frog virus 3 (isolate Goorha) (FV-3)]
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSD
FKTVLGSALLAVERDMVHVVPKYLQTPGILHDMLVLLTPIFGEALSVDMSGATDVMVQQIATAGFVDVDPLHSSVSWKDN
VSCPVALLAVSNAVRTMMGQPCQVTLIIDVGTQNILRDLVNLPVEMSGDLQVMAYTKDPLGKVPAVGVSVFDSGSVQKGD
AHSVGAPDGLVSFHTHPVSSAVELNYHAGWPSNVDMSSLLTMKNLMHVVVAEEGLWTMARTLSMQRLTKVLTDAEKDVMR
AAAFNLFLPLNELRVMGTKDSNNKSLKTYFEVFETFTIGALMKHSGVTPTAFVDRRWLDNTIYHMGFIPWGRDMRFVVEY
DLDGTNPFLNTVPTLMSVKRKAKIQEMFDNMVSRMVTS
2 - 9: ArpllGKT
>sp|Q6GZX1|004R_FRG3G (60 aa)
Uncharacterized protein 004R. [Frog virus 3 (isolate Goorha) (FV-3)]
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
33 - 40: GyyydGKT
>sp|Q6GZW0|015R_FRG3G (322 aa)
Uncharacterized protein 015R. [Frog virus 3 (isolate Goorha) (FV-3)]
MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGKSTLIKALLDSKRHIIPCAVVISGSEEANGFYKGVVPDLFI
YHQFSPSIIDRIHRRQVKAKAEMGSKKSWLLVVIDDCMDNAKMFNDKEVRALFKNGRHWNVLVVIANQYVMDLTPDLRSS
VDGVFLFRENNVTYRDKTYANFASVVPKKLYPTVMETVCQNYRCMFIDNTKATDNWHDSVFWYKAPYSKSAVAPFGARSY
WKYACSKTGEEMPAVFDNVKILGDLLLKELPEAGEALVTYGGKDGPSDNEDGPSDDEDGPSDDEEGLSKDGVSEYYQSDL
DD
34 - 41: GkpgsGKS
>sp|P32234|128UP_DROME (368 aa)
GTP-binding protein 128up. [Drosophila melanogaster (Fruit fly)]
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVAKTGDARVGFVGFPSVGKSTL
LSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPGIIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKL
LEHELEGFGIRLNKKPPNIYYKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKGQLPDYNSPVVLHNERTSIED
FCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDEDVVQIVKKV
71 - 78: GfpsvGKS
>sp|P05080|194K_TRVSY (1707 aa)
Replicase large subunit. [Tobacco rattle virus (strain SYM)]
MANGNFKLSQLLNVDEMSAEQRSHFFDLMLTKPDCEIGQMMQRVVVDKVDDMIRERKTKDPVIVHEVLSQKEQNKLMEIY
PEFNIVFKDDKNMVHGFAAAERKLQALLLLDRVPALQEVDDIGGQWSFWVTRGEKRIHSCCPNLDIRDDQREISRQIFLT
AIGDQARSGKRQMSENELWMYDQFRKNIAAPNAVRCNNTYQGCTCRGFSDGKKKGAQYAIALHSLYDFKLKDLMATMVEK
KTKVVHAAMLFAPESMLVDEGPLPSVDGYYMKKNGKIYFGFEKDPSFSYIHDWEEYKKYLLGKPVSYQGNVFYFEPWQVR
GDTMLFSIYRIAGVPRRSLSSQEYYRRIYISRWENMVVVPIFDLVESTRELVKKDLFVEKQFMDKCLDYIARLSDQQLTI
SNVKSYLSSNNWVLFINGAAVKNKQSVDSRDLQLLAQTLLVKEQVARPVMRELREAILTETKPITSLTDVLGLISRKLWK
QFANKIAVGGFVGMVGTLIGFYPKKVLTWAKDTPNGPELCYENSHKTKVIVFLSVVYAIGGITLMRRDIRDGLVKKLCDM
FDIKRGAHVLDVENPCRYYEINDFFSSLYSASESGETVLPDLSEVKAKSDKLLQQKKEIADEFLSAKFSNYSGSSVRTSP
PSVVGSSRSGLGLLLEDSNVLTQARVGVSRKVDDEEIMEQFLSGLIDTEAEIDEVVSAFSAECERGETSGTKVLCKPLTP
PGFENVLPAVKPLVSKGKTVKRVDYFQVMGGERLPKRPVVSGDNSVDARREFLYYLDAERVAQNDEIMSLYRDYSRGVIR
TGGQNYPHGLGVWDVEMKNWCIRPVVTEHAYVFQPDKRMDDWSGYLEVAVWERGMLVNDFAVERMSDYVIVCDQTYLCNN
RLILDNLSALDLGPVNCSFELVDGVPGCGKSTMIVNSANPCVDVVLSTGRAATDDLIERFASKGFPCKLKRRVKTVDSFL
MHCVDGSLTGDVLHFDEALMAHAGMVYFCAQIAGAKRCICQGDQNQISFKPRVSQVDLRFSSLVGKFDIVTEKRETYRSP
ADVAAVLNKYYTGDVRTHNATANSMTVRKIVSKEQVSLKPGAQYITFLQSEKKELVNLLALRKVAAKVSTVHESQGETFK
DVVLVRTKPTDDSIARGREYLIVALSRHTQSLVYETVKEDDVSKEIRESAALTKAALARFFVTETVLXRFRSRFDVFRHH
EGPCAVPDSGTITDLEMWYDALFPGNSLRDSSLDGYLVATTDCNLRLDNVTIKSGNWKDKFAEKETFLKPVIRTAMPDKR
KTTQLESLLALQKRNQAAPDLQENVHATVLIEETMKKLKSVVYDVGKIRADPIVNRAQMERWWRNQSTAVQAKVVADVRE
LHEIDYSSYMYMIKSDVKPKTDLTPQFEYSALQTVVYHEKLINSLFGPIFKEINERKLDAMQPHFVFNTRMTSSDLNDRV
KFLNTEAAYDFVEIDMSKFDKSANRFHLQLQLEIYRLFGLDEWAAFLWEVSHTQTTVRDIQNGMMAHIWYQQKSGDADTY
NANSDRTLCALLSELPLEKAVMVTYGGDDSLIAFPRGTQFVDPCPKLATKWNFECKIFKYDVPMFCGKFLLKTSSCYEFV
PDPVKVLTKLGKKSIKDVQHLAEIYISLNDSNRALGNYMVVSKLSESVSDRYLYKGDSVHALCALWKHIKSFTALCTLFR
DENDKELNPAKVDWKKAQRAVSNFYDW
904 - 911: GvpgcGKS
>sp|P03589|1A_AMVLE (1126 aa)
Replication protein 1a. [Alfalfa mosaic virus (strain 425 / isolate Leiden)]
MNADAQSTDASLSMREPLSHASIQEMLRRVVEKQAADDTTAIGKVFSEAGRAYAQDALPSDKGEVLKISFSLDATQQNIL
RANFPGRRTVFSNSSSSSHCFAAAHRLLETDFVYRCFGNTVDSIIDLGGNFVSHMKVKRHNVHCCCPILDARDGARLTER
ILSLKSYVRKHPEIVGEADYCMDTFQKCSRRADYAFAIHSTSDLDVGELACSLDQKGVMKFICTMMVDADMLIHNEGEIP
NFNVRWEIDRKKDLIHFDFIDEPNLGYSHRFSLLKHYLTYNAVDLGHAAYRIERKQDFGGVMVIDLTYSLGFVPKMPHSN
GRSCAWYNRVKGQMVVHTVNEGYYHHSYQTAVRRKVLVDKKVLTRVTEVAFRQFRPNADAHSAIQSIATMLSSSTNHTII
GGVTLISGKPLSPDDYIPVATTIYYRVKKLYNAIPEMLSLLDKGERLSTDAVLKGSEGPMWYSGPTFLSALDKVNVPGDF
VAKALLSLPKRDLKSLFSRSATSHSERTPVRDESPIRCTDGVFYPIRMLLKCLGSDKFESVTITDPRSNTETTVDLYQSF
QKKIETVFSFILGKIDGPSPLISDPVYFQSLEDVYYAEWHQGNAIDASNYARTLLDDIRKQKEESLKAKAKEVEDAQKLN
RAILQVHAYLEAHPDGGKIEGLGLSSQFIAKIPELAIPTPKPLPEFEKNAETGEILRINPHSDAILEAIDYLKSTSANSI
ITLNKLGDHCQWTTKGLDVVWAGDDKRRAFIPKKNTWVGPTARSYPLAKYERAMSKDGYVTLRWDGEVLDANCVRSLSQY
EIVFVDQSCVFASAEAIIPSLEKALGLEAHFSVTIVDGVAGCGKTTNIKQIARSSGRDVDLILTSNRSSADELKETIDCS
PLTKLHYIRTCDSYLMSASAVKAQRLIFDECFLQHAGLVYAAATLAGCSEVIGFGDTEQIPFVSRNPSFVFRHHKLTGKV
ERKLITWRSPADATYCLEKYFYKNKKPVKTNSRVLRSIEVVPINSPVSVERNTNALYLCHTQAEKAVLKAQTHLKGCDNI
FTTHEAQGKTFDNVYFCRLTRTSTSLATGRDPINGPCNGLVALSRHKKTFKYFTIAHDSDDVIYNACRDAGNTDDSILAR
SYNHNF
838 - 845: GvagcGKT
>sp|Q9AT00|TGD3_ARATH (345 aa)
Protein TRIGALACTOSYLDIACYLGLYCEROL 3, chloroplastic. [Arabidopsis thaliana (Mouse-ear cress)]
MLSLSCSSSSSSLLPPSLHYHGSSSVQSIVVPRRSLISFRRKVSCCCIAPPQNLDNDATKFDSLTKSGGGMCKERGLEND
SDVLIECRDVYKSFGEKHILKGVSFKIRHGEAVGVIGPSGTGKSTILKIMAGLLAPDKGEVYIRGKKRAGLISDEEISGL
RIGLVFQSAALFDSLSVRENVGFLLYERSKMSENQISELVTQTLAAVGLKGVENRLPSELSGGMKKRVALARSLIFDTTK
EVIEPEVLLYDEPTAGLDPIASTVVEDLIRSVHMTDEDAVGKPGKIASYLVVTHQHSTIQRAVDRLLFLYEGKIVWQGMT
HEFTTSTNPIVQQFATGSLDGPIRY
117 - 124: GpsgtGKS
有人能帮我找到正确的密码吗?你知道吗
这段代码的前提条件是accessie中的值可以是键(它们是字符串或unicode),accessie和organime是长度相同的列表,否则必须使用切片使它们具有相同的长度
你的代码几乎不可读。你知道吗
相关问题 更多 >
编程相关推荐