如何使用Python提取第二次出现特定单词后的文本文件部分

for file in tqdm(files): with open(file, encoding='ISO-8859-1') as f: for line in f: if line.strip() == 'Item 1A.Risk Factors': break for line in f: if line.strip() == 'Item 1B': break f=open(os.path.join('QTR4_Risk_Factors', os.path.basename(file)) , 'w') f.write(line) f.close()

3条回答

网友

1楼 · 编辑于 2024-05-02 07:00:55

你可以试试正则表达式：

import re

t = """Item 1a.Risk Factors

not any text (unwanted portion)
Item 1b

End of table of contents

Main content
Item 1a. Risk Factors

text (wanted portion)
text (wanted portion)
text (wanted portion)
Item 1b"""

crit = re.compile('Item 1a.Risk Factors.*?Item 1a. Risk Factors(.*?)Item 1b', re.I|re.DOTALL)
if re.search(crit, t):
    result = re.search(crit, t).group(1)

网友

2楼 · 编辑于 2024-05-02 07:00:55

您编写的代码几乎没有问题，其中一个问题是在扫描文档查找“结束文本”时，您没有保存所需的部分文本。如果可能的话，最好在内存中存储尽可能少的文本，因为我们不知道您试图分析的文档有多大。为此，我们可以在读取原始文件时写入新文件

Ronie的答案是正确的，但它没有说明您只想在第二次出现“开始提示”后才开始保存文本。不幸的是，我还不能评论建议的编辑，所以我添加它作为一个新的答案。试试这个：

for file in tqdm(files):
    with open(file, encoding='ISO-8859-1') as f, open(os.path.join('QTR4_Risk_Factors', os.path.basename(file)) , 'w') as w:
        start_hint_counter = 0
        write = False
        for line in f:
            if write is False and line.strip() == 'Item 1A.Risk Factors': 
                start_hint_counter += 1
                if start_hint_counter == 2:
                    write = True
            if write is True:
                if line.strip() == 'Item 1B':
                    break
                else:
                    w.write(line)

网友

3楼 · 编辑于 2024-05-02 07:00:55

我认为你应该做一个标志来知道什么时候复制这些行。您还可以在上下文管理器中同时打开两个或多个文件

with open(file, encoding='ISO-8859-1') as f, open(os.path.join('QTR4_Risk_Factors', os.path.basename(file)) , 'w') as w:
    write = False
    for line in f:
        if line.strip() == 'Item 1A.Risk Factors': 
            write = True
            continue
        elif line.strip() == 'Item 1B':
            write = False
        if write == True:
            w.write(line)

Ronie's answer is going in the right direction but it doesn't address the fact that you want to start saving the text only after the second occurrence of your "start hint".

编辑：添加了continue

相关问题更多 >

编程相关推荐

热门问题

热门文章