提取和处理两个字符串之间的信息,这些字符串沿着fi重复多次

2024-05-20 00:00:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个这样结构的文件:

 LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
 PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME=   122.771603 - DENSITY  2.704 g/cm^3
         A              B              C           ALPHA      BETA       GAMMA
     6.32540491     6.32540491     6.32540491    46.774144  46.774144  46.774144
 *******************************************************************************
 ATOMS IN THE ASYMMETRIC UNIT    3 - ATOMS IN THE UNIT CELL:   10
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
      3 T   6 C     2.500000000000E-01  2.500000000000E-01  2.500000000000E-01
      4 F   6 C    -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
      5 T   8 O    -4.912600492192E-01 -8.739950780750E-03  2.500000000000E-01
      6 F   8 O     2.500000000000E-01 -4.912600492193E-01 -8.739950780750E-03
      7 F   8 O    -8.739950780750E-03  2.500000000000E-01 -4.912600492193E-01
      8 F   8 O     4.912600492193E-01  8.739950780750E-03 -2.500000000000E-01
      9 F   8 O    -2.500000000000E-01  4.912600492193E-01  8.739950780750E-03
     10 F   8 O     8.739950780750E-03 -2.500000000000E-01  4.912600492193E-01

 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
  1.0000  0.0000  1.0000 -1.0000  1.0000  1.0000  0.0000 -1.0000  1.0000

 *******************************************************************************
 CRYSTALLOGRAPHIC CELL (VOLUME=        368.31480902)
         A              B              C           ALPHA      BETA       GAMMA
     5.02162261     5.02162261    16.86554607    90.000000  90.000000 120.000000

 COORDINATES IN THE CRYSTALLOGRAPHIC CELL
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA    0.000000000000E+00  0.000000000000E+00 -5.000000000000E-01
      3 T   6 C     3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
      4 F   6 C    -3.333333333333E-01  3.333333333333E-01  8.333333333333E-02
      5 T   8 O    -4.079267158859E-01 -3.333333333333E-01 -8.333333333333E-02
      6 F   8 O     3.333333333333E-01 -7.459338255258E-02 -8.333333333333E-02
      7 F   8 O     7.459338255258E-02  4.079267158859E-01 -8.333333333333E-02
      8 F   8 O     4.079267158859E-01  3.333333333333E-01  8.333333333333E-02
      9 F   8 O    -3.333333333333E-01  7.459338255258E-02  8.333333333333E-02
     10 F   8 O    -7.459338255258E-02 -4.079267158859E-01  8.333333333333E-02

 T = ATOM BELONGING TO THE ASYMMETRIC UNIT


more lines
more lines
more lines

 FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3
 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500)
 *******************************************************************************
 LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
 PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME=   119.823364 - DENSITY  2.770 g/cm^3
         A              B              C           ALPHA      BETA       GAMMA
     6.28373604     6.28373604     6.28373604    46.646397  46.646397  46.646397
 *******************************************************************************
 ATOMS IN THE ASYMMETRIC UNIT    3 - ATOMS IN THE UNIT CELL:   10
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
      3 T   6 C     2.500000000000E-01  2.500000000000E-01  2.500000000000E-01
      4 F   6 C    -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
      5 T   8 O    -4.924094276183E-01 -7.590572381674E-03  2.500000000000E-01
      6 F   8 O     2.500000000000E-01 -4.924094276183E-01 -7.590572381674E-03
      7 F   8 O    -7.590572381674E-03  2.500000000000E-01 -4.924094276183E-01
      8 F   8 O     4.924094276183E-01  7.590572381674E-03 -2.500000000000E-01
      9 F   8 O    -2.500000000000E-01  4.924094276183E-01  7.590572381674E-03
     10 F   8 O     7.590572381674E-03 -2.500000000000E-01  4.924094276183E-01

 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
  1.0000  0.0000  1.0000 -1.0000  1.0000  1.0000  0.0000 -1.0000  1.0000

 *******************************************************************************
 CRYSTALLOGRAPHIC CELL (VOLUME=        359.47009054)
         A              B              C           ALPHA      BETA       GAMMA
     4.97568007     4.97568007    16.76591397    90.000000  90.000000 120.000000

 COORDINATES IN THE CRYSTALLOGRAPHIC CELL
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.491739570355E-17 -2.745869785177E-17 -5.000000000000E-01
      3 T   6 C     3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
      4 F   6 C    -3.333333333333E-01  3.333333333333E-01  8.333333333333E-02
      5 T   8 O    -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02
      6 F   8 O     3.333333333333E-01 -7.574276095166E-02 -8.333333333333E-02
      7 F   8 O     7.574276095166E-02  4.090760942850E-01 -8.333333333333E-02
      8 F   8 O     4.090760942850E-01  3.333333333333E-01  8.333333333333E-02
      9 F   8 O    -3.333333333333E-01  7.574276095166E-02  8.333333333333E-02
     10 F   8 O    -7.574276095166E-02 -4.090760942850E-01  8.333333333333E-02

 T = ATOM BELONGING TO THE ASYMMETRIC UNIT
 INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE

more lines
more lines
more lines

 FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3
 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500)
 *******************************************************************************
 LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM
 PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME=   121.143469 - DENSITY  2.740 g/cm^3
         A              B              C           ALPHA      BETA       GAMMA
     6.32229536     6.32229536     6.32229536    46.436583  46.436583  46.436583
 *******************************************************************************
 ATOMS IN THE ASYMMETRIC UNIT    3 - ATOMS IN THE UNIT CELL:   10
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA    5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01
      3 T   6 C     2.500000000000E-01  2.500000000000E-01  2.500000000000E-01
      4 F   6 C    -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01
      5 T   8 O    -4.927088991116E-01 -7.291100888437E-03  2.500000000000E-01
      6 F   8 O     2.500000000000E-01 -4.927088991116E-01 -7.291100888437E-03
      7 F   8 O    -7.291100888437E-03  2.500000000000E-01 -4.927088991116E-01
      8 F   8 O     4.927088991116E-01  7.291100888437E-03 -2.500000000000E-01
      9 F   8 O    -2.500000000000E-01  4.927088991116E-01  7.291100888437E-03
     10 F   8 O     7.291100888437E-03 -2.500000000000E-01  4.927088991116E-01

 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL
  1.0000  0.0000  1.0000 -1.0000  1.0000  1.0000  0.0000 -1.0000  1.0000

 *******************************************************************************
 CRYSTALLOGRAPHIC CELL (VOLUME=        363.43040599)
         A              B              C           ALPHA      BETA       GAMMA
     4.98494429     4.98494429    16.88768068    90.000000  90.000000 120.000000

 COORDINATES IN THE CRYSTALLOGRAPHIC CELL
     ATOM                 X/A                 Y/B                 Z/C
 *******************************************************************************
      1 T  20 CA    0.000000000000E+00  0.000000000000E+00  0.000000000000E+00
      2 F  20 CA   -5.471726358381E-17 -2.735863179191E-17 -5.000000000000E-01
      3 T   6 C     3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02
      4 F   6 C    -3.333333333333E-01  3.333333333333E-01  8.333333333333E-02
      5 T   8 O    -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02
      6 F   8 O     3.333333333333E-01 -7.604223244490E-02 -8.333333333333E-02
      7 F   8 O     7.604223244490E-02  4.093755657782E-01 -8.333333333333E-02
      8 F   8 O     4.093755657782E-01  3.333333333333E-01  8.333333333333E-02
      9 F   8 O    -3.333333333333E-01  7.604223244490E-02  8.333333333333E-02
     10 F   8 O    -7.604223244490E-02 -4.093755657782E-01  8.333333333333E-02

 T = ATOM BELONGING TO THE ASYMMETRIC UNIT
 INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE

more lines
more lines
more lines

我想提取CRYSTALLOGRAPHIC CELL的信息;但只提取来自FINAL OPTIMIZED GEOMETRY的信息。你知道吗

以下3个匹配项:

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3$'
middle_pattern = '^ CRYSTALLOGRAPHIC CELL '
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'

允许搜索信息。你知道吗

首先,我定义了一个标志passed_mid_point = False

然后程序的以下部分提取VOLUMEFINAL OPTIMIZED GEOMETRYCRYSTALLOGRAPHIC CELL

VOLUMES = []
with open('g.out') as file:
    passed_mid_point = False
    for line in file:
        if re.match(initial_pattern, line):
            passed_mid_point = False
            print file.next()
            print file.next()
            print file.next()

            volume_line = file.next()
            print volume_line
            aux = volume_line.split()
            each_volume = aux[7]
            print each_volume
            VOLUMES.append(each_volume)
print 'VOLUMES = ', VOLUMES

这是正确的,因为VOLUMES = ['119.823364', '121.143469']。注意,初始的122.771603(参见原始文件)并没有像预期的那样收集到列表中。你知道吗

在提取AC(在我的程序中,P0P1)时,FINAL OPTIMIZED GEOMETRYCRYSTALLOGRAPHIC CELL的参数以及坐标:

        if re.match(middle_pattern, line):
            passed_mid_point = True

            print line

            print file.next()
            parameters_line = file.next()
            aux = parameters_line.split()
            p0 = aux[0]
            p1 = aux[1]
            p2 = aux[2]
            p3 = aux[3]
            p4 = aux[4]
            p5 = aux[5] # 

            print p0
            print p2

            P0.append(p0)
            P2.append(p2)

            print file.next()
            print file.next()
            print file.next()
            print file.next()

        if re.match(end_pattern, line):
            passed_mid_point = False

        elif passed_mid_point:
            # parse the coordinates
            print 'line2 =', line
            terms = line.split()
            print 'terms =', terms
#           print 'terms[1] =', terms[1]

            if terms and terms[1] == 'T':
                print terms[1]
                atomic_number = terms[2]
                print 'atomic_number = ', atomic_number
                ATOMIC_NUMBERS.append(atomic_number)

                x = terms[4]
                print 'x =', x
                Xs.append(x)

                y = terms[5]
                print 'y = ', y
                Ys.append(y)

                z = terms[6]
                print 'z = ', z
                Zs.append(z)

print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

结果如下:

P0 =  ['5.02162261', '4.97568007', '4.98494429']

这是错误的,因为5.02162261不是来自FINAL OPTIMIZED GEOMETRY(参见文件)。你知道吗

坐标也是错误的:

Xs =  ['0.000000000000E+00', '3.333333333333E-01', '-4.079267158859E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
Ys =  ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
Zs =  ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']
ATOMIC_NUMBERS =  ['20', '6', '8', '20', '6', '8', '20', '6', '8']

这将是期望的结果:

VOLUMES =  ['119.823364', '121.143469']
P0 = ['4.97568007', '4.98494429']
P1 = [16.76591397, '16.88768068']
Xs =  ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
Ys =  ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
Zs =  ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']
ATOMIC_NUMBERS =  ['20', '6', '8', '20', '6', '8']

如果你能帮助我,我将不胜感激

整个代码:

import sys
import re
import os

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM      3$'
middle_pattern = '^ CRYSTALLOGRAPHIC CELL '
end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'

global N_atom_irreducible_unit
N_atom_irreducible_unit = 3

VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []

with open('g.out') as file:
    passed_mid_point = False
    for line in file:
        if re.match(initial_pattern, line):
            passed_mid_point = False
            print file.next()
            print file.next()
            print file.next()

            volume_line = file.next()
            print volume_line
            aux = volume_line.split()
            each_volume = aux[7]
            print each_volume
            VOLUMES.append(each_volume)

        if re.match(middle_pattern, line):
            passed_mid_point = True

            print line

            print file.next()
            parameters_line = file.next()
            aux = parameters_line.split()
            p0 = aux[0]
            p1 = aux[1]
            p2 = aux[2]
            p3 = aux[3]
            p4 = aux[4]
            p5 = aux[5] # 

            print p0
            print p2

            P0.append(p0)
            P2.append(p2)

            print file.next()
            print file.next()
            print file.next()
            print file.next()

        if re.match(end_pattern, line):
            passed_mid_point = False

        elif passed_mid_point:
            # parse the coordinates
            print 'line2 =', line
            terms = line.split()
            print 'terms =', terms
#           print 'terms[1] =', terms[1]

            if terms and terms[1] == 'T':
                print terms[1]
                atomic_number = terms[2]
                print 'atomic_number = ', atomic_number
                ATOMIC_NUMBERS.append(atomic_number)

                x = terms[4]
                print 'x =', x
                Xs.append(x)

                y = terms[5]
                print 'y = ', y
                Ys.append(y)

                z = terms[6]
                print 'z = ', z
                Zs.append(z)

print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

Tags: thelineunitcellcafilenextpattern
2条回答

我写了你的剧本的简化版,看起来不错。我希望这可以作为你最后剧本的起点:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []

with open('g.out') as gout:
    final_optimized_geometry = False
    for line in gout:
        if 'FINAL OPTIMIZED GEOMETRY' in line:
            final_optimized_geometry = True
        elif 'PRIMITIVE CELL' in line:
            if not final_optimized_geometry:
                continue
            volume = line.split()[7]
            VOLUMES.append(volume)
        elif 'CRYSTALLOGRAPHIC CELL (VOLUME=' in line:
            if not final_optimized_geometry:
                continue
            gout.readline()
            line = gout.readline()
            p0, p2 = line.split()[0:3:2]

            P0.append(p0)
            P2.append(p2)
        elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line:
            if not final_optimized_geometry:
                continue
            gout.readline()
            gout.readline()
            while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line:
                line = gout.readline()
                atomdata = line.split()
                if not atomdata or atomdata[1] != 'T':
                    continue
                atomicnumber = atomdata[2]
                x, y, z = atomdata[4:7]
                ATOMIC_NUMBERS.append(atomicnumber)
                Xs.append(x)
                Ys.append(y)
                Zs.append(z)
            final_optimized_geometry = False


print(VOLUMES)
print(P0)
print(P2)
print(ATOMIC_NUMBERS)
print(Xs)
print(Ys)
print(Zs)

这将生成以下输出:

['119.823364', '121.143469']
['4.97568007', '4.98494429']
['16.76591397', '16.88768068']
['20', '6', '8', '20', '6', '8']
['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01']
['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01']
['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']

实际上,它是一个非常简单的有限状态机,只有两个状态。警告:如果在一个最终优化的几何结构中有多个晶体学单元,它将不起作用。在这种情况下,它将只捕获第一个单元格的信息。你知道吗

代码还对文件进行了其他假设,当然,这些假设可能需要验证。你知道吗

我避免使用正则表达式。你知道吗

此代码将仅在Python3中运行(针对Python3.6.2进行了测试)。Python2.7将在文件迭代块中使用readline()(这是有意义的,但是看到Python3可以使用它是很好的)。我们使用readline()作为一个小技巧来跳过输入文件中的行,我们知道必须跳过这些行,而不必再次遍历整个循环(这将需要更多的标志变量)。你知道吗

顺便说一下,如果您的唯一任务是解析文本文件,那么查看专用语言(例如Lex)可能会很有趣。而且,Perl的设计目的是做类似的事情,而不是Python。你知道吗

希望这有帮助!你知道吗

感谢所有@Bart Van Loon的帮助,更简单的代码版本是:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

global N_atom_irreducible_unit
N_atom_irreducible_unit = 3

filename = 'g.out'

VOLUMES = []
P0 = []
P2 = []
ATOMIC_NUMBERS = []
Xs = []
Ys = []
Zs = []

with open(filename) as gout:
    final_optimized_geometry = False
    for line in gout:
        if 'FINAL OPTIMIZED GEOMETRY' in line:
            final_optimized_geometry = True
        elif 'PRIMITIVE CELL - CENTRING CODE' in line:
            if final_optimized_geometry:
                volume = line.split()
                print volume
                print volume[7]
                volume = line.split()[7]
                VOLUMES.append(volume)

        elif ' CRYSTALLOGRAPHIC CELL (V' in line:
            if final_optimized_geometry:
                print 'gout.next() =', gout.next()
                done = gout.next()
                print 'done =', done
                p0 = done.split()[0]
                p2 = done.split()[2]

#               p0, p2 = done.split()[0:3:2]

                P0.append(p0)
                P2.append(p2)
        elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line:
            if final_optimized_geometry:
                gout.next()
                gout.next()
                while True:
                    line = gout.next()
                    atomdata = line.split()
                    if not atomdata:
                        break
                    if atomdata[1] != 'T':
                        continue
                    atomicnumber = atomdata[2]
                    x, y, z = atomdata[4:7]
                    ATOMIC_NUMBERS.append(atomicnumber)
                    Xs.append(x)
                    Ys.append(y)
                    Zs.append(z)
                final_optimized_geometry = False



print 'VOLUMES = ', VOLUMES
print 'P0 = ', P0
print 'P2 = ', P2
print 'Xs = ', Xs
print 'Ys = ', Ys
print 'Zs = ', Zs
print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

其中:

1)因为最后一个原子(本例中为第10个原子)后面的下一行是空行

                    if not atomdata:
                        break

atomdata为空时,将始终停止。换言之,当空行时,即原子列表结束时,这将始终停止。因此,这将允许避免while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line:语句。你知道吗

类似的说法是:

                    if  atomdata:   
                        continue

然而,由于某种原因,我不明白,这不能解释非空行作为唯一要分析的。为什么?你知道吗

2)这部分代码:

                if atomdata[1] != 'T':
                    continue
                atomicnumber = atomdata[2]
                x, y, z = atomdata[4:7]
                ATOMIC_NUMBERS.append(atomicnumber)
                Xs.append(x)
                Ys.append(y)
                Zs.append(z)

也可以说:

              if atomdata[1] == 'T':
                  atomicnumber = atomdata[2]
                  x, y, z = atomdata[4:7]
                  ATOMIC_NUMBERS.append(atomicnumber)
                  Xs.append(x)
                  Ys.append(y)
                  Zs.append(z)

相关问题 更多 >