python3docx在两段之间获取文本

2024-09-30 10:26:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我在一个目录中有.docx文件,我想得到两段之间的所有文本

示例:

Foo :

The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

Bar :

我想得到:

The foo is not easy, but we have to do it.
We are looking for new things in our ad libitum way of life. 

我写了这段代码:

import docx
import pathlib
import glob
import re

def rf(f1):
    reader = docx.Document(f1)
    alltext = []
    for p in reader.paragraphs:
        alltext.append(p.text)
    return '\n'.join(alltext)


for f in docxfiles:
    try:
        fulltext = rf(f)
        testf = re.findall(r'Foo\s*:(.*)\s*Bar', fulltext, re.DOTALL)
        
        print(testf)
    except IOError:
        print('Error opening',f)

它返回None

我做错了什么


Tags: theinimportreforfooishave
1条回答
网友
1楼 · 发布于 2024-09-30 10:26:51

如果循环所有段落并打印段落文本,则文档文本将保持原样,但循环的单个p.text不包含完整的文档文本

它与字符串一起工作:

t = """Foo :

The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

Bar :"""
      
import re
      
fread = re.search(r'Foo\s*:(.*)\s*Bar', t)
      
print(fread)  # None  - because dots do not match \n
     
fread = re.search(r'Foo\s*:(.*)\s*Bar', t, re.DOTALL)
      
print(fread)
print(fread[1])

输出:

<_sre.SRE_Match object; span=(0, 115), match='Foo :\n\nThe foo is not easy, but we have to do i>


The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

如果你使用

for p in reader.paragraphs:
    print("********")
    print(p.text)
    print("********")

你知道为什么你的正则表达式不匹配了。您的正则表达式可以处理整个文档文本

请参阅How to extract text from an existing docx file using python-docx如何获取整个文档文本

您还可以查找与r'Foo\s*:'匹配的段落,然后将下面所有段落.文本放入列表,直到找到与r'\s*Bar'匹配的段落

相关问题 更多 >

    热门问题