python3docx在两段之间获取文本

import docx import pathlib import glob import re def rf(f1): reader = docx.Document(f1) alltext = [] for p in reader.paragraphs: alltext.append(p.text) return '\n'.join(alltext) for f in docxfiles: try: fulltext = rf(f) testf = re.findall(r'Foo\s*:(.*)\s*Bar', fulltext, re.DOTALL) print(testf) except IOError: print('Error opening',f)

1条回答

网友

1楼 · 发布于 2024-09-30 10:26:51

如果循环所有段落并打印段落文本，则文档文本将保持原样，但循环的单个p.text不包含完整的文档文本

它与字符串一起工作：

t = """Foo :

The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

Bar :"""
      
import re
      
fread = re.search(r'Foo\s*:(.*)\s*Bar', t)
      
print(fread)  # None  - because dots do not match \n
     
fread = re.search(r'Foo\s*:(.*)\s*Bar', t, re.DOTALL)
      
print(fread)
print(fread[1])

输出：

<_sre.SRE_Match object; span=(0, 115), match='Foo :\n\nThe foo is not easy, but we have to do i>


The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

如果你使用

for p in reader.paragraphs:
    print("********")
    print(p.text)
    print("********")

你知道为什么你的正则表达式不匹配了。您的正则表达式可以处理整个文档文本

请参阅How to extract text from an existing docx file using python-docx如何获取整个文档文本

您还可以查找与r'Foo\s*:'匹配的段落，然后将下面所有段落.文本放入列表，直到找到与r'\s*Bar'匹配的段落

相关问题更多 >

编程相关推荐

热门问题

热门文章