如何使用Python从PDF中将表格提取为文本？问题的回答

如何使用Python从PDF中将表格提取为文本？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

这个答案适用于任何遇到带有图像的PDF并需要使用OCR的人。我找不到可行的现成解决方案；没有什么能给我提供我所需要的准确度 以下是我发现有效的步骤 <ol> <li>使用<a href="https://poppler.freedesktop.org/" rel="noreferrer">https://poppler.freedesktop.org/</a>中的<code>pdfimages</code>将pdf页面转换为图像</li> <li>使用<a href="https://tesseract-ocr.github.io/" rel="noreferrer">Tesseract</a>检测旋转，使用<a href="https://www.imagemagick.org/script/mogrify.php" rel="noreferrer">ImageMagick</a>{<cd2>}修复旋转</li> <li>使用OpenCV查找和提取表</li> <li>使用OpenCV查找并从表中提取每个单元格</li> <li>使用OpenCV对每个单元格进行裁剪和清理，这样就不会有干扰OCR软件的噪音</li> <li>使用Tesseract对每个单元格进行OCR</li> <li>将每个单元格的提取文本合并为所需的格式</li> </ol> 我编写了一个python包，其中的模块可以帮助完成这些步骤 回购：<a href="https://github.com/eihli/image-table-ocr" rel="noreferrer">https://github.com/eihli/image-table-ocr</a> 文件及；资料来源：<a href="https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html" rel="noreferrer">https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html</a> 有些步骤不需要代码，它们利用了<code>pdfimages</code>和<code>tesseract</code>等外部工具。我将为确实需要代码的两个步骤提供一些简短的示例 <ol start=“2”> <li>查找表：</li> </ol> 在了解如何查找表时，此链接是一个很好的参考<a href="https://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/" rel="noreferrer">https://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/</a> <pre><code>import cv2 def find_tables(image): BLUR_KERNEL_SIZE = (17, 17) STD_DEV_X_DIRECTION = 0 STD_DEV_Y_DIRECTION = 0 blurred = cv2.GaussianBlur(image, BLUR_KERNEL_SIZE, STD_DEV_X_DIRECTION, STD_DEV_Y_DIRECTION) MAX_COLOR_VAL = 255 BLOCK_SIZE = 15 SUBTRACT_FROM_MEAN = -2 img_bin = cv2.adaptiveThreshold( ~blurred, MAX_COLOR_VAL, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, BLOCK_SIZE, SUBTRACT_FROM_MEAN, ) vertical = horizontal = img_bin.copy() SCALE = 5 image_width, image_height = horizontal.shape horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (int(image_width / SCALE), 1)) horizontally_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, horizontal_kernel) vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, int(image_height / SCALE))) vertically_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, vertical_kernel) horizontally_dilated = cv2.dilate(horizontally_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1))) vertically_dilated = cv2.dilate(vertically_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (1, 60))) mask = horizontally_dilated + vertically_dilated contours, hierarchy = cv2.findContours( mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE, ) MIN_TABLE_AREA = 1e5 contours = [c for c in contours if cv2.contourArea(c) > MIN_TABLE_AREA] perimeter_lengths = [cv2.arcLength(c, True) for c in contours] epsilons = [0.1 * p for p in perimeter_lengths] approx_polys = [cv2.approxPolyDP(c, e, True) for c, e in zip(contours, epsilons)] bounding_rects = [cv2.boundingRect(a) for a in approx_polys] # The link where a lot of this code was borrowed from recommends an # additional step to check the number of "joints" inside this bounding rectangle. # A table should have a lot of intersections. We might have a rectangular image # here though which would only have 4 intersections, 1 at each corner. # Leaving that step as a future TODO if it is ever necessary. images = [image[y:y+h, x:x+w] for x, y, w, h in bounding_rects] return images </code></pre> <ol start=“3”> <li>从表中提取单元格</李> </ol> 这与2非常相似，因此我不会包含所有代码。我将参考的部分是对单元格进行排序 我们想从左到右，从上到下识别细胞 我们将找到最左上角的矩形。然后我们将找到所有中心位于左上角矩形上y和下y值范围内的矩形。然后我们将根据矩形中心的x值对其进行排序。我们将从列表中删除这些矩形并重复 <pre><code>def cell_in_same_row(c1, c2): c1_center = c1[1] + c1[3] - c1[3] / 2 c2_bottom = c2[1] + c2[3] c2_top = c2[1] return c2_top < c1_center < c2_bottom orig_cells = [c for c in cells] rows = [] while cells: first = cells[0] rest = cells[1:] cells_in_same_row = sorted( [ c for c in rest if cell_in_same_row(c, first) ], key=lambda c: c[0] ) row_cells = sorted([first] + cells_in_same_row, key=lambda c: c[0]) rows.append(row_cells) cells = [ c for c in rest if not cell_in_same_row(c, first) ] # Sort rows by average height of their center. def avg_height_of_center(row): centers = [y + h - h / 2 for x, y, w, h in row] return sum(centers) / len(centers) rows.sort(key=avg_height_of_center) </code></pre>

如何使用Python从PDF中将表格提取为文本？

1 个回答

相关Python问题