如何读取一个巨大的fi的特定块

2024-10-03 13:19:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个2 GB左右的大文件,里面有这样的数据:

>TRINITY_DN19211_c0_g1_i1 len=332 path=[619:0-331] [-1, 619, -2]
GTCCAAGTATTACACACCGTATGATGAAGCTAACGGTGAATTTTCAAAATGTGTGAAGTT
TGAGAATGGGTTGCGCCCTGAGATCAAACAGGCGATTGGATACCAGAGGATTCGAAGGTT
TTCGGAGTTGGTAGACTGCTGCAGGATCTTTGAAGAGGATTCCAGAGCAAGGTCAACTCA
>TRINITY_DN63782_c0_g1_i1 len=433 path=[411:0-432] [-1, 411, -2]
ATAGACACGAACACAAACACATAAATAATTTGAGAAAATAGAAGTGATTGAACTTGTTGG
TGTGGTACAGGTGTCAAACAAACCTTCAACCAGAAGTTTTGTTGCTGCATAAATCATAGT
GACACTCTGATATGATATCAAAGAAAATCATGTAACCCAAATACATCCCTAAGTATCTAG
TTGAAGCTACAGTCCACTAATTGTAACAATATTAAGTAATTATGAAATGAACCATTTGCA
>TRINITY_DN35855_c0_g1_i1 len=782 path=[760:0-781] [-1, 760, -2]
CAGGTTTAACTTTAACACCTCCGACCCTGCCTCTAAATTCCTGCACAGAAATTTGGCTTC
ACAATTAGGACATGTTTGGATAAACAGTTTAATGAAGCACTTTTTTTCATAAATTCTGGT
ATCTGGCTATAAGACCTAATAATCTGGGGATCTGTTTCATCATCCACGAAGGGAGCCCAA
>TRINITY_DN67801_c0_g1_i1 len=420 path=[398:0-419] [-1, 398, -2]
GTACAGAAGGAGATGAACCAGAACTTTGCCTATCTCTACAATCATCTCCTTATCCCTCCT
TATGACCCAGAGAATCCGGCTGCTCCTATTCCTCCCGTTGTGTCACTACAAATTATGCCT
>TRINITY_DN52435_c0_g1_i1 len=209 path=[187:0-208] [-1, 187, -2]
TGGTCAAACTTGTATGAGTTCTAAACTCCTTGGGTTTTCTGCTAAGCGAAAGCCGCTTGT
ACTTTAGCTTCTGTTTAGTTAGATAGCACCACCTCATAAGCGCAGTTCTGTTTTGAGGTT

我想写一个代码,返回一个从5行开始的块,如果遇到字符“>;”就结束排成一行。像这样出去。我想取出很多这样的卡盘:

 >TRINITY_DN63782_c0_g1_i1 len=433 path=[411:0-432] [-1, 411, -2]
    ATAGACACGAACACAAACACATAAATAATTTGAGAAAATAGAAGTGATTGAACTTGTTGG
    TGTGGTACAGGTGTCAAACAAACCTTCAACCAGAAGTTTTGTTGCTGCATAAATCATAGT
    GACACTCTGATATGATATCAAAGAAAATCATGTAACCCAAATACATCCCTAAGTATCTAG
    TTGAAGCTACAGTCCACTAATTGTAACAATATTAAGTAATTATGAAATGAACCATTTGCA

最好的办法是什么。提前谢谢


Tags: 文件数据pathlengbg1i1c0
3条回答
start_ln = 4
chunk = []
with open("data.txt", buffer=2**12) as f:  # buffering helps for speed of processing
   for i, ln in enumerate(f):
        if start_ln == i:
           chunk.append(ln)
        elif start_ln < i:
           chunk.append(ln)
        elif line.startswith(">"):
           break   

现在还不清楚你希望你的区块什么时候结束,但是-当它遇到一个'>;'在一行的开头或行中的任何地方,所以我假设第一种情况:

chunk = []
with open("your_large_file.ext", "r") as f:
    for _ in xrange(4):  # skip 4 lines, use range() on Python 3.x instead
        next(f)
    for line in f:
        if chunk and line.startswith(">"):  # break on > if we're already collecting a chunk
            break
        chunk.append(line)
print("".join(chunk))  # or whatever you want to do with it

是的

>TRINITY_DN63782_c0_g1_i1 len=433 path=[411:0-432] [-1, 411, -2]
ATAGACACGAACACAAACACATAAATAATTTGAGAAAATAGAAGTGATTGAACTTGTTGG
TGTGGTACAGGTGTCAAACAAACCTTCAACCAGAAGTTTTGTTGCTGCATAAATCATAGT
GACACTCTGATATGATATCAAAGAAAATCATGTAACCCAAATACATCCCTAAGTATCTAG
TTGAAGCTACAGTCCACTAATTGTAACAATATTAAGTAATTATGAAATGAACCATTTGCA

如果您知道数据从哪一行开始,则可以使用此函数:

def extract_chunk(start_line):
    """
    start_line is the line number where your data starts, counting from 0
    """
    lines = []
    with open("data.txt") as f:
        for i, line in enumerate(f):
            if i == start_line:
                lines.append(line)
            elif not line.startswith(">") and i > start_line:
                lines.append(line)
            elif line.startswith(">"):
                break
    return "".join(lines)

相关问题 更多 >