擅长:python、mysql、java
<p>通常TOC表示为页面上的常规文本</p>
<p>尝试<a href="https://pypi.org/project/pdfreader/" rel="nofollow noreferrer">pdfreader</a>提取文本和/或PDF“标记”</p>
<p>以下是从页面中提取上述所有内容的示例代码:</p>
<pre><code>from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(your_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
# navigate to TOC
viewer.navigate(toc_page_number)
viewer.render()
pdf_markdown = viewer.canvas.text_content
plain_text = "".join(viewer.canvas.strings)
</code></pre>
<p>然后可以将<code>plain_text</code>或<code>pdf_markdown</code>解析为常规字符串</p>