文档提取
DoT-Net的Python项目详细描述
文档提取
目的:
用于将非结构化OCR文档转换为结构化键值对。在
所需软件包:
- Wand
- Pytesseract
- Tesseract
- Ghost script
- Imagemagick
- Open CV
- Sklearn
- Keras
- Tensorflow
使用
替换GETO2.0.py主函数中pdf的绝对路径
使用的关键技术:
- Deep learning,
- Ensembled learning
机器学习架构描述
- DoT-Net: DoT-Net is a novel and innovative CNN architecture to classify and segment the text elements in the document.
- RFClassifier: RFClassifier is ensembled deep learning architecture used to detect TOC pages with in the document.
框架结构流程图
代码如下:
- GETO2.0.py is the interface for our framework.
- Segmentation.py is the module for DoT-Net. This function is used in GETO2.0.py
- TOCclassifier.py is the module to detect the TOC in the document. This function is used in GETO2.0.py
- TESSARACT.py is used for extract text entites from detected blocks of text in segmentation.py. This function is used in TOCclassifier.py
- BlockParsing.py is used to extract TOC entites form TOCs pages detected in TOCclassifier. This function is used in Segementation.py
代码流:
代码详细说明:
获取02.0.py:
GETO2.0是我们框架的主要接口。输入的pdf文件中的每一页都使用wand库转换为图像。这个转换图像使用TOC分类器检查TOC(我们只检查第一个N页中的TOC)。在
- [x] 检测为目录的页面。
- 在
tocClassifier.py:TOCclassifier检查页面中的TOC。如果页面被分类为TOC,那么我们使用^{str1}$镶嵌线.py提取目录的文本信息并附加到列表中。
- 在
镶嵌线.py: 镶嵌线.py使用pytesseract(tesseract的python包装器)。Tesseract是一个从图像中提取文本的框架),用于从目录中提取文本。
在
- 在
- 在
- [x] 页面检测为非目录。
- Note:第一个N之后的页面也被视为非ToC。在
- 在
分段.py:分段执行多个任务。
在- It segements the pages by using image morophology methods and counter functions, to find the Conneted Comments (Blocks).
- A sliding window is passed over these Connected Components to generate 100 * 100 size tiles (DoT-Net takes 100 * 100 tiles as input to classify.
- A data dulipcation or augmentation is performed on blocks which are less than 100 * 100 (especially for headings the blocks size will be less than 100 * 100), to avoid the data missing issue.
- Now this is 100 * 100 are classifed using DoT-Net.
- After patch classification we use majorty voting to predict the label of block.
- If block label is text. Then we use blockparsing.py to extract the text from blocks.
- Note: Our DoT-Net can detect other classes such as Table, Image, Mathematical Expressions, and Line drawings, but for this project we are only focused on Text.
- Blockparsing.py uses pytesseract to extract the text.
- Append the extracted text in list
- [x] 目录中的文本和剩余的PDF文档被扩展并附加在各自的列表中。
在- After Extracting text from TOC and remaining pdf document and appended in list.
- we use fuzzy matching and regular expression matchings techniques to create JSON files
- 项目
标签: