使用Tesseract OCR 4.x保存缩进

from PIL import Image import pytesseract # Preserve interword spaces is set to 1, oem = 1 is LSTM, # PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita' # default_config = r'-c -l eng+ita' extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config) print(extracted_text) # saving to a txt file with open("referto.txt", "w") as text_file: text_file.write(extracted_text)

1条回答

网友

1楼 · 发布于 2024-09-28 05:27:44

image_to_data()函数提供了更多信息。对于每个单词，它将返回它的边界矩形。你可以用这个。在

Tesseract自动将图像分割成块。然后，您可以按块的垂直位置对其进行排序，并为每个块找到平均字符宽度（这取决于块识别的字体）。对于块中的每个单词，检查它是否接近上一个单词，如果没有相应地添加空格。我使用pandas来简化计算，但它的用法不是必需的。别忘了结果应该用等宽字体显示。在

import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd

custom_config = r'-c preserve_interword_spaces=1  oem 1  psm 1 -l eng+ita'
d = pytesseract.image_to_data(Image.open(r'referto-2.jpg'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)

# clean up blanks
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
    curr = df1[df1['block_num']==block]
    sel = curr[curr.text.str.len()>3]
    char_w = (sel.width/sel.text.str.len()).mean()
    prev_par, prev_line, prev_left = 0, 0, 0
    text = ''
    for ix, ln in curr.iterrows():
        # add new line when necessary
        if prev_par != ln['par_num']:
            text += '\n'
            prev_par = ln['par_num']
            prev_line = ln['line_num']
            prev_left = 0
        elif prev_line != ln['line_num']:
            text += '\n'
            prev_line = ln['line_num']
            prev_left = 0

        added = 0  # num of spaces that should be added
        if ln['left']/char_w > prev_left + 1:
            added = int((ln['left'])/char_w) - prev_left
            text += ' ' * added 
        text += ln['text'] + ' '
        prev_left += len(ln['text']) + added + 1
    text += '\n'
    print(text)

此代码将生成以下输出：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章