我有下面的代码,它适用于大多数类型的图像。但由于某些原因,它不适用于仅包含1页和pdf的tiff图像
我有一个错误:
回溯(最近一次呼叫最后一次): 文件“/Users/fatiatravaille/Downloads/ocr_json/test.py”,第8行,在 image=image.open(r./radio\u lomb\u 300.tiff') 打开文件“/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site packages/PIL/Image.py”,第3023行 引发无法识别的图像错误( PIL.UnidentifiedImageError:无法识别图像文件'/radio\u lomb\u 300.tiff'
import pytesseract
try:
from PIL import Image
except ImportError:
import Image
image = Image.open(r'./radio_lomb_300.tiff')
text=(pytesseract.image_to_string(image, lang='fra'))+'\n\n\n\n'
with open('text.test_ocr2','w') as fp: fp.write(text)
text=(pytesseract.image_to_boxes(image, lang='fra'))
with open('boundingBoxes.test_ocr2','w') as fp: fp.write(text)
text=(pytesseract.image_to_data(image, lang='fra'))
with open('data.test_ocr2','w') as fp: fp.write(text)
text=(pytesseract.image_to_osd(image))
with open('osd.test_ocr2','w') as fp: fp.write(text)
pdf = pytesseract.image_to_pdf_or_hocr(image, extension='pdf', lang='fra')
with open('test_ocr2.pdf', 'w+b') as f: f.write(pdf)
hocr = pytesseract.image_to_pdf_or_hocr(image, extension='hocr', lang='fra')
with open('test_ocr2.xml', 'w+b') as f: f.write(hocr)
hocr = pytesseract.image_to_pdf_or_hocr(image, extension='hocr', lang='fra')
with open('test_ocr2.xml', 'w+b') as f: f.write(hocr)
hocr = pytesseract.image_to_alto_xml(image)
with open('test_ocr_alto2.xml', 'w+b') as f: f.write(hocr)
您是否尝试与
opencv
一起使用例如,当我使用opencv时
结果将是:
虽然我不确定输出有多准确。您可以检查page segmentation methods以提高质量
相关问题 更多 >
编程相关推荐