<p>这就是你想要的吗?我不知道“dd”是什么,我假装页眉和页脚是以特定方式开始的单行。您需要替换匹配页眉和页脚的逻辑,以适合您的用例。你知道吗</p>
<pre><code>import re
VALID_DOCUMENT_A = \
"""
header: 1
Hi there.
This is the body of page 1.
footer: 1
header: 2
This is the second page.
footer: 2
"""
VALID_DOCUMENT_B = \
"""
header: 1
Hi there.
This is the body of page 1.
footer: 1
header: 2
This is the second page.
footer: 2
header: 3
This third page has a header but no footer.
"""
INVALID_DOCUMENT_A = \
"""
header: 1
Hi there.
This is the body of page 1.
footer: 1
This is the second page. Where's the header, though?
footer: 2
"""
INVALID_DOCUMENT_B = \
"""
header: 1
Hi there.
This is the body of page 1.
footer: 1a
footer: 1b
This is the second page. Oops - two footers above.
footer: 2
"""
INVALID_DOCUMENT_C = \
"""
header: 1
Hi there.
This is the body of page 1.
footer: 1
header: 2a
header: 2b
This is the second page. Oops - two headers above.
footer: 2
"""
def is_header(line):
return re.match('^header:.*', line)
def is_footer(line):
return re.match('^footer:.*', line)
def pair_headers_and_footers(text):
lines = text.splitlines()
stack = []
for i, line in enumerate(lines, 1):
if is_header(line):
if stack:
raise ValueError('Got unexpected header on line {}'.format(i))
stack.append(line)
elif is_footer(line):
if not stack:
raise ValueError('Got unexpected footer on line {}'.format(i))
yield stack.pop(), line
if __name__ == '__main__':
documents = [
VALID_DOCUMENT_A, # 2 headers, 2 footers
VALID_DOCUMENT_B, # 3 headers, 2 footers
INVALID_DOCUMENT_A, # missing header for page 1
INVALID_DOCUMENT_B, # multiple footers for page 1
INVALID_DOCUMENT_C # multiple headers for page 2
]
for document in documents:
try:
print(list(pair_headers_and_footers(document)))
except ValueError as e:
print(e)
</code></pre>
<p><strong>输出</strong></p>
<pre class="lang-none prettyprint-override"><code>[('header: 1', 'footer: 1'), ('header: 2', 'footer: 2')]
[('header: 1', 'footer: 1'), ('header: 2', 'footer: 2')]
Got unexpected footer on line 7
Got unexpected footer on line 6
Got unexpected header on line 7
</code></pre>
<p><strong>附录</p>
<p>我应该在函数<code>pair_headers_and_footers</code>中添加以下内容:</p>
<pre><code>lines = text.splitlines()
</code></pre>
<p>使用:</p>
<pre><code>lines = (m.group(0).rstrip() for m in re.finditer('(.*\n|.+$)', text))
</code></pre>
<p>这可能有助于减少内存使用,特别是在处理大量文本时。通过这种修改,页眉和页脚配对的整个过程变得“懒惰”。你知道吗</p>