库提供了对pdf/image的有用操作
pdfutil的Python项目详细描述
PDFUTIL[开发中]
库提供了很多对pdf/image的操作。
输入和输出
libarary用一组为eevry函数固定的标准参数公开每个函数
import pdfutil
coordinates = pdfutil.detect_*(pdf_location, [save_result=False], [show_result=False], [result_location='.'], [args={}])
Name | Description |
---|---|
pdf_location | input location of PDF, image can also be passed libaray will autodetect the image |
save_result | Default False, If True will save the result pdf/img in location specified by result_location |
show_result | Default False, This is used for debugging only when True will popup a matplotlib plot highlighting the regions which are detected with corresponding labels |
result_location | Default current directory, location where ouptut needs to be saved, ignored if save_result is set as False |
args | custom set of args in form of dictionaty specific to each function |
coordinates | Output returned by the function call, this will contain json output in following format |
[
{
"type": "text",
"output": {
"coord": [
["pageno_1", "startx_1", "starty_1", "width_1", "height_1"],
["pageno_2", "startx_2", "starty_2", "width_2", "height_2"]
]
}
},
{
"type": "table",
"output": {
"coord": [
["pageno_1", "startx_1", "starty_1", "width_1", "height_1"],
]
}
}
]
操作
检测表
import pdfutil
coordinates = pdfutil.detect_tables(pdf_location)
检测文本区域[段落/非结构化内容]
import pdfutil
coordinates = pdfutil.detect_text(pdf_location)
检测非文本区域[图像/徽标]
import pdfutil
coordinates = pdfutil.detect_non_text(pdf_location)
检测语言
import pdfutil
coordinates = pdfutil.detect_non_language(pdf_location)
检测键值对
import pdfutil
coordinates = pdfutil.detect_key_value_pairs(pdf_location)