使用python通过start和end两个标识符从一个巨大的文本文件中提取行

ATP(1):C39(3) - A:TYR(58):CD2(67) ATP(1):C39(3) - A:TYR(58):CE2(69) ATP(1):C59(6) - A:ILE(61):CD1(100) ATP(1):C59(6) - A:LYS(87):CE(344) Hydrogen bonds: Location of Donor | Sidechain/Backbone | Secondary Structure | Count -------------------|--------------------|---------------------|------- LIGAND | SIDECHAIN | OTHER | 1 RECEPTOR | BACKBONE | BETA | 1 Raw data: ATP(1):O2A(9) - A:ILE(61):HN(93) - A:ILE(61):N(92) Hydrophobic contacts (C-C): Sidechain/Backbone | Secondary Structure | Count --------------------|---------------------|------- SIDECHAIN | OTHER | 2 SIDECHAIN | BETA | 23 Raw data: ATP(1):C39(3) - A:TYR(58):CD2(67) ATP(1):C39(3) - A:TYR(58):CE2(69) ATP(1):C59(6) - A:ILE(61):CD1(100) ATP(1):C59(6) - A:LYS(87):CE(344) ATP(1):C4(23) - A:PHE(209):CD1(1562) ATP(1):C4(23) - A:PHE(209):CE1(1564) ATP(1):C2(26) - A:PHE(209):CD2(1563) ATP(1):C6(28) - A:PHE(209):CB(1560) ATP(1):C6(28) - A:PHE(209):CG(1561) ATP(1):C6(28) - A:PHE(209):CD1(1562) ATP(1):C6(28) - A:VAL(286):CG2(2266) pi-pi stacking interactions: ATP(1):C8(30) - A:LYS(87):CG(342) ATP(1):C8(30) - A:GLU(159):CD(1066) ATP(1):C8(30) - A:PHE(209):CE1(1564)

from itertools import islice def start_end_points(file_name): f = open(file_name) lines = f.readlines() for s, line in enumerate(lines): if "Hydrogen bonds:" in line: print s for e, line in enumerate(lines): if "pi-pi stacking interactions:" in line: print e print islice(lines, s, e) start_end_points("foo.txt")

3条回答

网友

1楼 · 编辑于 2024-09-28 20:51:33

我认为这样更有效，因为您可以在f上迭代，所以您可以保存这个列表转换lines = f.readlines()。此外，此代码只在数据中运行一次（使用2个while循环），其中代码使用2个for循环运行到文件末尾。你知道吗

from pprint import pprint


def start_end_points(file_name):

    f = open(file_name)

    single_line = next(f)

    while "Hydrogen bonds:" not in single_line:
        single_line = next(f)

    result = []

    while "pi-pi stacking interactions:" not in single_line:
        result.append(single_line.rstrip())
        single_line = next(f)


    f.close()

    pprint(result)

重要注意事项：打开文件后，仍然可以修改它。因此，在while循环中读取的行可能不是打开f时所想到的行。你知道吗

输出btw：

['Hydrogen bonds:',
 '    Location of Donor | Sidechain/Backbone | Secondary Structure | Count',
 '            -|          |          -|   -',
 '          LIGAND      |      SIDECHAIN     |        OTHER        |   1',
 '',
 '         RECEPTOR     |      BACKBONE      |         BETA        |   1',
 '',
 'Raw data:',
 '     ATP(1):O2A(9) - A:ILE(61):HN(93) - A:ILE(61):N(92)',
 '',
 'Hydrophobic contacts (C-C):',
 '    Sidechain/Backbone | Secondary Structure | Count',
 '             |          -|   -',
 '         SIDECHAIN     |        OTHER        |   2',
 '         SIDECHAIN     |         BETA        |   23',
 '',
 'Raw data:',
 '     ATP(1):C39(3) - A:TYR(58):CD2(67)',
 '     ATP(1):C39(3) - A:TYR(58):CE2(69)',
 '     ATP(1):C59(6) - A:ILE(61):CD1(100)',
 '     ATP(1):C59(6) - A:LYS(87):CE(344)',
 '     ATP(1):C4(23) - A:PHE(209):CD1(1562)',
 '     ATP(1):C4(23) - A:PHE(209):CE1(1564)',
 '     ATP(1):C2(26) - A:PHE(209):CD2(1563)',
 '     ATP(1):C6(28) - A:PHE(209):CB(1560)',
 '     ATP(1):C6(28) - A:PHE(209):CG(1561)',
 '     ATP(1):C6(28) - A:PHE(209):CD1(1562)',
 '     ATP(1):C6(28) - A:VAL(286):CG2(2266)',
 '']

网友

2楼 · 编辑于 2024-09-28 20:51:33

你甚至不必把所有的行都保存到内存中！你知道吗

效率高的with会自动关闭文件，因此非常有效和有用。你知道吗

注意这两个选项-如果都是关于效率的，选择第一个。你知道吗

我建议return隐藏行而不是print隐藏它-也许你会在其中有额外的用途，然后你可以再次打印，而不是再次运行整个函数。你知道吗

def start_end_points(file_name):

    wanted_text = ""

    # USE this way -EFFICIENT!

    with open(file_name, "r") as f:
        found = False
        for line in f:
            if found:
                if "pi-pi stacking interactions:" in line:
                    break
                else:
                    wanted_text += line 
            if "Hydrogen bonds:" in line:
                wanted_text += line
                found = True



    # OR use this way *less efficient memory speaking*, but pythonic

    with open(file_name, "r") as f:
        all = f.read().split('\n')
        numbers = [i for i, line in enumerate(all) if "Hydrogen bonds:" in line or "pi-pi stacking interactions:" in line]
        wanted_text = all[numbers[0]:numbers[1]]


    # eventually, return:
    return wanted_text


data = start_end_points("foo.txt")

网友
3楼 · 编辑于 2024-09-28 20:51:33

您没有理由将整个文件加载到内存中！你知道吗

def start_end_points(file_name):
    with open(file_name) as f:
        found = False
        for line in f:
            if found or ("Hydrogen bonds:" in line):
                found = True
                print line
            if "pi-pi stacking interactions:" in line:
                break

start_end_points("foo.txt")

这样，内存中只保留一个缓冲区，每行处理一次，一旦到达pi，就停止读取文件。。。行。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章