<p>从<a href="https://stackoverflow.com/questions/55704218/how-to-check-if-pdf-is-scanned-image-or-contains-text">How to check if PDF is scanned image or contains text</a>修改<a href="https://stackoverflow.com/a/59966201/10393104">this answer</a></p>
<p>在这个解决方案中,您不必渲染pdf,因此我猜它会更快。基本上,我修改的答案使用文本覆盖的pdf区域的百分比来确定它是文本文档还是扫描文档(图像)</p>
<p>我添加了一个类似的推理,计算图像覆盖的总面积来计算图像覆盖的百分比。如果它大部分被图像覆盖,您可以假定它是扫描的文档。您可以移动阈值以适应文档集合</p>
<p>我还添加了逻辑来逐页检查。这是因为至少在我拥有的文档集合中,一些文档可能有一个数字创建的第一页,然后扫描其余的页面</p>
<p>修改代码:</p>
<pre><code>import fitz #pip install PyMuPDF
def page_type(page):
page_area =abs(page.rect) #total page area
img_area=0.0
for block in page.getText("RAWDICT")["blocks"]:
if block["type"]==1: #Type=1 are images
bbox=block["bbox"]
img_area+=(bbox[2]-bbox[0])*(bbox[3]-bbox[1]) # width*height
img_perc=img_area / page_area
print("Image area proportion: "+str(img_perc))
text_area = 0.0
for b in page.getTextBlocks():
r = fitz.Rect(b[:4]) # rectangle where block text appears
text_area = text_area + abs(r)
text_perc=text_area / page_area
print("Text area proportion: "+str(text_perc))
if text_perc < 0.01: #No text = Scanned
page_type="Scanned"
elif img_perc > .8: #Has text but very large images = Searchable
page_type="Searchable text"
else:
page_type="Digitally created"
return page_type
doc=fitz.open(pdffilepath)
for page in doc: #Iterate through pages to find different types
print(page_type(page))
</code></pre>