基于内容的文本文件拆分方法

1. Ann Intern Med. 2013 Dec 3;159(11):721-8. doi:10.7326/0003-4819-159-11-201312030-00004. text text text texttext texttext texttext texttext texttext texttext texttext text text texttext texttext texttext texttext texttext text text texttext texttext texttext texttext text PMID: 24297188 [PubMed - indexed for MEDLINE] 2. Am J Cardiol. 2013 Sep 1;112(5):688-93. doi: 10.1016/j.amjcard.2013.04.048. Epub 2013 May 24. text texttext texttext texttext texttext texttext texttext texttext texttext text text texttext texttext texttext texttext texttext texttext texttext texttext text PMID: 23711805 [PubMed - indexed for MEDLINE] 3. Am J Cardiol. 2013 Aug 15;112(4):513-9. doi: 10.1016/j.amjcard.2013.04.015. Epub 2013 May 11. text texttext texttext texttext texttext texttext texttext texttext texttext text text texttext texttext texttext texttext texttext texttext texttext texttext text PMID: 23672989 [PubMed - indexed for MEDLINE]

2条回答

网友

1楼 · 编辑于 2024-09-28 23:43:47

有很多方法可以做到这一点。有一种方法。如果数据在名为data的文件中：

import re

def open_chunk(readfunc, delimiter, chunksize=1024):
    """
    http://stackoverflow.com/a/17508761/190597
    readfunc(chunksize) should return a string.
    """
    remainder = ''
    for chunk in iter(lambda: readfunc(chunksize), ''):
        pieces = re.split(delimiter, remainder + chunk)
        for piece in pieces[:-1]:
            yield piece
        remainder = pieces[-1]
    if remainder:
        yield remainder

with open('data', 'r') as infile:
    chunks = open_chunk(infile.read, delimiter=r'(PMID.*)')
    for i, (chunk, delim) in enumerate(zip(*[chunks]*2)):
        chunk = chunk+delim
        chunk = chunk.strip()
        if chunk:
            print(chunk)
            print('-'*80)
            # uncomment this if you want to save the chunk to a file named dataXXX
            # with open('data{:03d}'.format(i), 'w') as outfile:
            #     outfile.write(chunk)

印刷品

^{pr2}$

取消最后两行的注释以将块保存到单独的文件中。在

为什么这么复杂？

对于短文件，您可以简单地将整个文件读入一个字符串，然后使用正则表达式拆分字符串。上面的解决方案是对这种可以处理大文件的思想的改编。它以块的形式读取文件，找到要拆分块的位置，并在找到块时返回块。在

处理由分隔符regex模式分隔的文件块的问题经常出现。因此，与其为每一个都编写一个定制的解决方案，不如使用像open_chunk这样的实用程序函数来处理所有这些问题，而不管分隔符是什么，而且它的处理方式既可以处理大文件，也可以处理小文件。在

网友

2楼 · 编辑于 2024-09-28 23:43:47

你可以试试：

with open("txtfile.txt", "r") as f:  # read file
    ss = f.read(-1)

bb = ss.split("\nPMID:")  # split in blocks

# Reinsert the `PMID;`, if nedded:
bb1 = bb[:1] + [ "PMID:" + b  for b in bb]

注意，每个块中的最后一个换行符被删除。块可以写入单独的文件中。在

相关问题更多 >

编程相关推荐

热门问题

热门文章