<p>我经常使用Python解析格式化的文本文件(用于生物学研究,但是我会尝试用一种你不需要生物学背景的方式来问我的问题。)我处理的是一种称为pdb文件的文件,它包含格式化文本中蛋白质的三维结构。这是一个例子:</p>
<pre><code>HEADER CHROMOSOMAL PROTEIN 02-JAN-87 1UBQ
TITLE STRUCTURE OF UBIQUITIN REFINED AT 1.8 ANGSTROMS RESOLUTION
REMARK 1
REMARK 1 REFERENCE 1
REMARK 1 AUTH S.VIJAY-KUMAR,C.E.BUGG,K.D.WILKINSON,R.D.VIERSTRA,
REMARK 1 AUTH 2 P.M.HATFIELD,W.J.COOK
REMARK 1 TITL COMPARISON OF THE THREE-DIMENSIONAL STRUCTURES OF HUMAN,
REMARK 1 TITL 2 YEAST, AND OAT UBIQUITIN
REMARK 1 REF J.BIOL.CHEM. V. 262 6396 1987
REMARK 1 REFN ISSN 0021-9258
ATOM 1 N MET A 1 27.340 24.430 2.614 1.00 9.67 N
ATOM 2 CA MET A 1 26.266 25.413 2.842 1.00 10.38 C
ATOM 3 C MET A 1 26.913 26.639 3.531 1.00 9.62 C
ATOM 4 O MET A 1 27.886 26.463 4.263 1.00 9.62 O
ATOM 5 CB MET A 1 25.112 24.880 3.649 1.00 13.77 C
ATOM 6 CG MET A 1 25.353 24.860 5.134 1.00 16.29 C
ATOM 7 SD MET A 1 23.930 23.959 5.904 1.00 17.17 S
ATOM 8 CE MET A 1 24.447 23.984 7.620 1.00 16.11 C
ATOM 9 N GLN A 2 26.335 27.770 3.258 1.00 9.27 N
ATOM 10 CA GLN A 2 26.850 29.021 3.898 1.00 9.07 C
ATOM 11 C GLN A 2 26.100 29.253 5.202 1.00 8.72 C
ATOM 12 O GLN A 2 24.865 29.024 5.330 1.00 8.22 O
ATOM 13 CB GLN A 2 26.733 30.148 2.905 1.00 14.46 C
ATOM 14 CG GLN A 2 26.882 31.546 3.409 1.00 17.01 C
ATOM 15 CD GLN A 2 26.786 32.562 2.270 1.00 20.10 C
ATOM 16 OE1 GLN A 2 27.783 33.160 1.870 1.00 21.89 O
ATOM 17 NE2 GLN A 2 25.562 32.733 1.806 1.00 19.49 N
ATOM 18 N ILE A 3 26.849 29.656 6.217 1.00 5.87 N
ATOM 19 CA ILE A 3 26.235 30.058 7.497 1.00 5.07 C
ATOM 20 C ILE A 3 26.882 31.428 7.862 1.00 4.01 C
ATOM 21 O ILE A 3 27.906 31.711 7.264 1.00 4.61 O
ATOM 22 CB ILE A 3 26.344 29.050 8.645 1.00 6.55 C
ATOM 23 CG1 ILE A 3 27.810 28.748 8.999 1.00 4.72 C
ATOM 24 CG2 ILE A 3 25.491 27.771 8.287 1.00 5.58 C
ATOM 25 CD1 ILE A 3 27.967 28.087 10.417 1.00 10.83 C
TER 26 ILE A 3
HETATM 604 O HOH A 77 45.747 30.081 19.708 1.00 12.43 O
HETATM 605 O HOH A 78 19.168 31.868 17.050 1.00 12.65 O
HETATM 606 O HOH A 79 32.010 38.387 19.636 1.00 12.83 O
HETATM 607 O HOH A 80 42.084 27.361 21.953 1.00 22.27 O
END
</code></pre>
<p><code>ATOM</code>标记包含原子坐标的行的开头。<code>TER</code>表示坐标的结束。我想获取包含原子坐标的整个文本块,因此我使用:</p>
^{pr2}$
<p>但它什么都不匹配。必须有一种使用正则表达式获取整个文本块的方法。我也不明白为什么这个regex模式不起作用。感谢帮助。在</p>