从复杂的multiparagraph文档中提取单词并将其输出为多行逗号分隔文件

QUERY: STBZIP38 Length of Query Sequence: 2000 bp | Nucleotide Frequencies: A - 0.34 G - 0.16 T - 0.35 C - 0.15 TFBS AC: RSP00073//OS: tobacco (Nicotiana tabacum) /GENE: synthetic oligonucleotides/TFBS: PA /BF: TAF-1 Motifs on "+" Strand: Mean Exp. Number 0.00391 Up.Conf.Int. 1 Found 1 421 tCCACGTGGC 430 (Mism.= 1) Motifs on "-" Strand: Mean Exp. Number 0.00391 Up.Conf.Int. 1 Found 1 430 GCCACGTGGa 421 (Mism.= 1) TFBS AC: RSP00153//OS: Parsley, Petroselinum crispum /GENE: CHS/TFBS: Box II /BF: CPRF-1; CPRF-2; CPRF-3; Motifs on "+" Strand: Mean Exp. Number 0.00358 Up.Conf.Int. 1 Found 1 422 CCACGTGGCa 431 (Mism.= 1) TFBS AC: RSP00154//OS: parsley (Petroselinum crispum) /GENE: CHS/TFBS: ACE (CHS) /BF: bZIP factors CPRF1, CPRF4 Motifs on "+" Strand: Mean Exp. Number 0.00358 Up.Conf.Int. 1 Found 1 422 CCACGTGGCa 431 (Mism.= 1) Totally 50 motifs of 43 different TFBSs have been found ____________________________________________________________ QUERY: STBZIP17 Length of Query Sequence: 2000 bp | Nucleotide Frequencies: A - 0.37 G - 0.13 T - 0.39 C - 0.11 TFBS AC: RSP00577//OS: tomato (Lycopersicon esculentum), Lycopersicon esculentum /GENE: rbcS3A/TFBS: AT-rich FF2 /BF: unknown nuclear factor Motifs on "-" Strand: Mean Exp. Number 0.00187 Up.Conf.Int. 1 Found 1 206 AATAATTAaAcATTAATTAA 187 (Mism.= 2) TFBS AC: RSP00797//OS: potato (Solanum tuberosum) /GENE: patatin 21/TFBS: SURE-1 /BF: SURF Motifs on "-" Strand: Mean Exp. Number 0.00440 Up.Conf.Int. 1 Found 1 1027 TAAAGAATAaAAAAAaaAA 1009 (Mism.= 3) TFBS AC: RSP00864//OS: arabidopsis (Arabidopsis thaliana) /GENE: STK/TFBS: GA-5 /BF: BPC1 Motifs on "-" Strand: Mean Exp. Number 0.00260 Up.Conf.Int. 1 Found 1 1966 AGAGAGAGA 1958 (Mism.= 0)

2条回答

网友

1楼 · 编辑于 2024-06-03 13:18:55

一个伟大的样板开始，我准备了确切的正则表达式模式，做其余的。PS：您需要的是readlines（）方法+regex，没有拆分

import re

s = """QUERY: STBZIP38
     Length of Query Sequence:       2000 bp     | Nucleotide Frequencies:  A -  0.34   G -  0.16   T -  0.35   C -  0.15
    
    
     TFBS AC: RSP00073//OS: tobacco (Nicotiana tabacum) /GENE: synthetic oligonucleotides/TFBS: PA /BF: TAF-1
     Motifs on "+" Strand: Mean Exp. Number   0.00391     Up.Conf.Int.  1     Found   1
         421  tCCACGTGGC      430 (Mism.= 1)
    
     Motifs on "-" Strand: Mean Exp. Number   0.00391     Up.Conf.Int.  1     Found   1
         430  GCCACGTGGa      421 (Mism.= 1)
    
     TFBS AC: RSP00153//OS: Parsley, Petroselinum crispum /GENE: CHS/TFBS: Box II /BF: CPRF-1; CPRF-2; CPRF-3;
     Motifs on "+" Strand: Mean Exp. Number   0.00358     Up.Conf.Int.  1     Found   1
         422  CCACGTGGCa      431 (Mism.= 1)
    
     TFBS AC: RSP00154//OS: parsley (Petroselinum crispum) /GENE: CHS/TFBS: ACE (CHS) /BF: bZIP factors CPRF1, CPRF4
     Motifs on "+" Strand: Mean Exp. Number   0.00358     Up.Conf.Int.  1     Found   1
         422  CCACGTGGCa      431 (Mism.= 1)
Totally      50 motifs of    43 different TFBSs have been found
 QUERY: STBZIP17
 Length of Query Sequence:       2000 bp     | Nucleotide Frequencies:  A -  0.37   G -  0.13   T -  0.39   C -  0.11


 TFBS AC: RSP00577//OS: tomato (Lycopersicon esculentum), Lycopersicon esculentum /GENE: rbcS3A/TFBS: AT-rich FF2 /BF: unknown nuclear factor
 Motifs on "-" Strand: Mean Exp. Number   0.00187     Up.Conf.Int.  1     Found   1
     206  AATAATTAaAcATTAATTAA      187 (Mism.= 2)

 TFBS AC: RSP00797//OS: potato (Solanum tuberosum) /GENE: patatin 21/TFBS: SURE-1 /BF: SURF
 Motifs on "-" Strand: Mean Exp. Number   0.00440     Up.Conf.Int.  1     Found   1
    1027  TAAAGAATAaAAAAAaaAA     1009 (Mism.= 3)

 TFBS AC: RSP00864//OS: arabidopsis (Arabidopsis thaliana) /GENE: STK/TFBS: GA-5 /BF: BPC1
 Motifs on "-" Strand: Mean Exp. Number   0.00260     Up.Conf.Int.  1     Found   1
    1966  AGAGAGAGA     1958 (Mism.= 0)"""
         
pat1='STB.*\d*'

pat2 = 'RSP.*OS'

m = re.findall(pat1,s)

n = re.findall(pat2, s)

#print(m, n)

print(m[0],  n[0])
print(m[0],  n[1])
print(m[0],  n[2])
print(m[1], n[3]) 
print(m[1],  n[4])
print(m[1],  n[5])

输出

STBZIP38 RSP00073//OS
STBZIP38 RSP00153//OS
STBZIP38 RSP00154//OS
STBZIP17 RSP00577//OS
STBZIP17 RSP00797//OS
STBZIP17 RSP00864//OS

网友

2楼 · 编辑于 2024-06-03 13:18:55

好的，伙计们。下面是它如何结束的。非常感谢CYREX和xelf（Reddit）提供的帮助

with open('Softberry.txt') as f:
    for line in f:
        if line.startswith(' QUERY:'):
            query = line.split(':', 1)[1].strip()
        if 'AC:' in line:
            ac = line.split('AC:')[1].split(':')[0].strip()
            print(query,ac)

相关问题更多 >

编程相关推荐

热门问题

热门文章