从复杂的multiparagraph文档中提取单词并将其输出为多行逗号分隔文件

2024-06-03 13:18:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一份格式如下的文件

    QUERY: STBZIP38
     Length of Query Sequence:       2000 bp     | Nucleotide Frequencies:  A -  0.34   G -  0.16   T -  0.35   C -  0.15
    
    
     TFBS AC: RSP00073//OS: tobacco (Nicotiana tabacum) /GENE: synthetic oligonucleotides/TFBS: PA /BF: TAF-1
     Motifs on "+" Strand: Mean Exp. Number   0.00391     Up.Conf.Int.  1     Found   1
         421  tCCACGTGGC      430 (Mism.= 1)
    
     Motifs on "-" Strand: Mean Exp. Number   0.00391     Up.Conf.Int.  1     Found   1
         430  GCCACGTGGa      421 (Mism.= 1)
    
     TFBS AC: RSP00153//OS: Parsley, Petroselinum crispum /GENE: CHS/TFBS: Box II /BF: CPRF-1; CPRF-2; CPRF-3;
     Motifs on "+" Strand: Mean Exp. Number   0.00358     Up.Conf.Int.  1     Found   1
         422  CCACGTGGCa      431 (Mism.= 1)
    
     TFBS AC: RSP00154//OS: parsley (Petroselinum crispum) /GENE: CHS/TFBS: ACE (CHS) /BF: bZIP factors CPRF1, CPRF4
     Motifs on "+" Strand: Mean Exp. Number   0.00358     Up.Conf.Int.  1     Found   1
         422  CCACGTGGCa      431 (Mism.= 1)
Totally      50 motifs of    43 different TFBSs have been found
____________________________________________________________

 QUERY: STBZIP17
 Length of Query Sequence:       2000 bp     | Nucleotide Frequencies:  A -  0.37   G -  0.13   T -  0.39   C -  0.11


 TFBS AC: RSP00577//OS: tomato (Lycopersicon esculentum), Lycopersicon esculentum /GENE: rbcS3A/TFBS: AT-rich FF2 /BF: unknown nuclear factor
 Motifs on "-" Strand: Mean Exp. Number   0.00187     Up.Conf.Int.  1     Found   1
     206  AATAATTAaAcATTAATTAA      187 (Mism.= 2)

 TFBS AC: RSP00797//OS: potato (Solanum tuberosum) /GENE: patatin 21/TFBS: SURE-1 /BF: SURF
 Motifs on "-" Strand: Mean Exp. Number   0.00440     Up.Conf.Int.  1     Found   1
    1027  TAAAGAATAaAAAAAaaAA     1009 (Mism.= 3)

 TFBS AC: RSP00864//OS: arabidopsis (Arabidopsis thaliana) /GENE: STK/TFBS: GA-5 /BF: BPC1
 Motifs on "-" Strand: Mean Exp. Number   0.00260     Up.Conf.Int.  1     Found   1
    1966  AGAGAGAGA     1958 (Mism.= 0)

我想要的输出如下

STBZIP38    RSP00073//OS
STBZIP38    RSP00153//OS
STBZIP38    RSP00154//OS
STBZIP17    RSP00577//OS
STBZIP17    RSP00797//OS
STBZIP17    RSP00864//OS

我正在看一些教程并尝试使用split函数(我仍在学习python的a、B、C)。我从以下内容开始,我仍在试图弄清楚的是,如何只抓取我使用的术语后面的单词(例如,QUERY:然后只抓取STBZIP38,然后抓取TFBC AC:后面的数字),。 如果有人能在这方面帮助我,我真的很感激。提前谢谢

with open ('Softberry.txt') as fo:
for rec in fo:
    print((rec.split('QUERY:')) + ',' +(rec.split('TFBS AC:')))

Tags: numberosonconfmeanacintup
2条回答

一个伟大的样板开始,我准备了确切的正则表达式模式,做其余的。PS:您需要的是readlines()方法+regex,没有拆分

import re

s = """QUERY: STBZIP38
     Length of Query Sequence:       2000 bp     | Nucleotide Frequencies:  A -  0.34   G -  0.16   T -  0.35   C -  0.15
    
    
     TFBS AC: RSP00073//OS: tobacco (Nicotiana tabacum) /GENE: synthetic oligonucleotides/TFBS: PA /BF: TAF-1
     Motifs on "+" Strand: Mean Exp. Number   0.00391     Up.Conf.Int.  1     Found   1
         421  tCCACGTGGC      430 (Mism.= 1)
    
     Motifs on "-" Strand: Mean Exp. Number   0.00391     Up.Conf.Int.  1     Found   1
         430  GCCACGTGGa      421 (Mism.= 1)
    
     TFBS AC: RSP00153//OS: Parsley, Petroselinum crispum /GENE: CHS/TFBS: Box II /BF: CPRF-1; CPRF-2; CPRF-3;
     Motifs on "+" Strand: Mean Exp. Number   0.00358     Up.Conf.Int.  1     Found   1
         422  CCACGTGGCa      431 (Mism.= 1)
    
     TFBS AC: RSP00154//OS: parsley (Petroselinum crispum) /GENE: CHS/TFBS: ACE (CHS) /BF: bZIP factors CPRF1, CPRF4
     Motifs on "+" Strand: Mean Exp. Number   0.00358     Up.Conf.Int.  1     Found   1
         422  CCACGTGGCa      431 (Mism.= 1)
Totally      50 motifs of    43 different TFBSs have been found
 QUERY: STBZIP17
 Length of Query Sequence:       2000 bp     | Nucleotide Frequencies:  A -  0.37   G -  0.13   T -  0.39   C -  0.11


 TFBS AC: RSP00577//OS: tomato (Lycopersicon esculentum), Lycopersicon esculentum /GENE: rbcS3A/TFBS: AT-rich FF2 /BF: unknown nuclear factor
 Motifs on "-" Strand: Mean Exp. Number   0.00187     Up.Conf.Int.  1     Found   1
     206  AATAATTAaAcATTAATTAA      187 (Mism.= 2)

 TFBS AC: RSP00797//OS: potato (Solanum tuberosum) /GENE: patatin 21/TFBS: SURE-1 /BF: SURF
 Motifs on "-" Strand: Mean Exp. Number   0.00440     Up.Conf.Int.  1     Found   1
    1027  TAAAGAATAaAAAAAaaAA     1009 (Mism.= 3)

 TFBS AC: RSP00864//OS: arabidopsis (Arabidopsis thaliana) /GENE: STK/TFBS: GA-5 /BF: BPC1
 Motifs on "-" Strand: Mean Exp. Number   0.00260     Up.Conf.Int.  1     Found   1
    1966  AGAGAGAGA     1958 (Mism.= 0)"""
         
pat1='STB.*\d*'

pat2 = 'RSP.*OS'

m = re.findall(pat1,s)

n = re.findall(pat2, s)

#print(m, n)

print(m[0],  n[0])
print(m[0],  n[1])
print(m[0],  n[2])
print(m[1], n[3]) 
print(m[1],  n[4])
print(m[1],  n[5])

输出

STBZIP38 RSP00073//OS
STBZIP38 RSP00153//OS
STBZIP38 RSP00154//OS
STBZIP17 RSP00577//OS
STBZIP17 RSP00797//OS
STBZIP17 RSP00864//OS


    

好的,伙计们。下面是它如何结束的。非常感谢CYREX和xelf(Reddit)提供的帮助

with open('Softberry.txt') as f:
    for line in f:
        if line.startswith(' QUERY:'):
            query = line.split(':', 1)[1].strip()
        if 'AC:' in line:
            ac = line.split('AC:')[1].split(':')[0].strip()
            print(query,ac)

相关问题 更多 >