<p>如果您的pdf是基于文本的,而不是扫描的文档(即,如果您可以在pdf查看器中单击并拖动以选择表格中的文本),则您可以将模块<a href="https://camelot-py.readthedocs.io/en/master/" rel="nofollow noreferrer">^{<cd1>}</a>用于</p>
<pre class="lang-py prettyprint-override"><code>import camelot
tables = camelot.read_pdf('foo.pdf')
</code></pre>
<p>然后,您可以选择如何保存表(作为csv、json、excel、html、sqlite),以及是否应在ZIP存档中压缩输出</p>
<pre class="lang-py prettyprint-override"><code>tables.export('foo.csv', f='csv', compress=False)
</code></pre>
<hr/>
<p>编辑:<a href="https://tabula-py.readthedocs.io/en/latest/" rel="nofollow noreferrer">^{<cd2>}</a>的显示速度大约是<code>camelot-py</code>的6倍,因此应该改用它</p>
<pre class="lang-py prettyprint-override"><code>import camelot
import cProfile
import pstats
import tabula
cmd_tabula = "tabula.read_pdf('table.pdf', pages='1', lattice=True)"
prof_tabula = cProfile.Profile().run(cmd_tabula)
time_tabula = pstats.Stats(prof_tabula).total_tt
cmd_camelot = "camelot.read_pdf('table.pdf', pages='1', flavor='lattice')"
prof_camelot = cProfile.Profile().run(cmd_camelot)
time_camelot = pstats.Stats(prof_camelot).total_tt
print(time_tabula, time_camelot, time_camelot/time_tabula)
</code></pre>
<p>给予</p>
<pre class="lang-py prettyprint-override"><code>1.8495559890000015 11.057014036000016 5.978199147125147
</code></pre>