Python invoice2data包_程序模块 - PyPI

从pdf发票中提取数据的python解析器

invoice2data的Python项目详细描述

已为GSoC 2018选择此项目。阅读更多 here。

一个模块化的python库来支持您的记帐过程。在测试 Python2.7和3.4+。主要步骤：

使用不同的技术从pdf文件中提取文本，如 pdftotext、pdfminer或ocr–tesseract、tesseract4或 gvision（谷歌云愿景）。
使用基于yaml的模板系统在结果中搜索正则表达式
将结果保存为csv、json或xml，或重命名pdf文件以匹配内容。

柔性模板系统可以：

精确匹配内容PDF文件
可用于匹配行项目和表的插件
为每个发票定义相同的静态字段
定义组织或流程中所需的自定义字段
每个字段有多个regex（如果布局或措辞更改）
定义货币
使用Holger Brunn开发的lines-插件提取发票项

从pdf文件转到此：

{'date': (2014, 5, 7), 'invoice_number': '30064443', 'amount': 34.73, 'desc': 'Invoice 30064443 from QualityHosting', 'lines': [{'price': 42.0, 'desc': u'Small Business StandardExchange 2010\nGrundgeb\xfchr pro Einheit\nDienst: OUDJQ_office\n01.05.14-31.05.14\n', 'pos': u'7', 'qty': 1.0}]}
{'date': (2014, 6, 4), 'invoice_number': 'EUVINS1-OF5-DE-120725895', 'amount': 35.24, 'desc': 'Invoice EUVINS1-OF5-DE-120725895 from Amazon EU'}
{'date': (2014, 8, 3), 'invoice_number': '42183017', 'amount': 4.11, 'desc': 'Invoice 42183017 from Amazon Web Services'}
{'date': (2015, 1, 28), 'invoice_number': '12429647', 'amount': 101.0, 'desc': 'Invoice 12429647 from Envato'}

安装

安装pdftotext

如果可能，获取最新的 xpdf/poppler-utils版本。是的包括MacOS Homebrew、Debian和Ubuntu。没有它， pdftotext无法正确分析pdf中的表。

使用pip安装invoice2data。

pip install invoice2data

用法

基本用法。处理pdf文件并将结果写入csv。

^{tt9}$
^{tt10}$

选择下列任一输入读取器：

pdftotext ^{tt11}$
tesseract ^{tt12}$
pdf miner ^{tt13}$
tesseract4 ^{tt14}$
gvision ^{tt15}$ (needs ^{tt16}$ env var)

选择下列任一输出格式：

csv ^{tt17}$
json ^{tt18}$
xml ^{tt19}$

使用自定义名称或特定文件夹保存输出文件 invoice2data --output-format csv --output-name myinvoices/invoices.csv invoice.pdf

注意：必须指定output-format，才能创建 output-name

使用yml模板指定文件夹。（例如，您的供应商） invoice2data --template-folderACME-templates invoice.pdf

仅使用自己的模板并排除内置 invoice2data --exclude-built-in-templates--template-folderACME-templates invoice.pdf

处理发票文件夹并将重命名的发票复制到新的文件夹。invoice2data --copy new_folder folder_with_invoices/*.pdf

处理单个文件并转储整个文件以进行调试（在在templates.py中添加新模板） invoice2data --debug my_invoice.pdf

确认测试发票： invoice2data invoice2data/test/pdfs/* --debug

如果要将其用作库，只需执行

from invoice2data import extract_data

result = extract_data('path/to/my/file.pdf')

如果要使用自己的模板，可以使用

from invoice2data import extract_data
from invoice2data.extract.loader import read_templates

templates = read_templates('/path/to/your/templates/')
result = extract_data(filename, templates=templates)

模板系统

对于现有模板，请参见。只要把添加您自己的列表。如果由更大的组织部署，应该为新供应商编辑模板的界面。80-20规则。为了一个有关如何添加新模板的简短教程，请参见 TUTORIAL.rst。

模板基于yaml。它们定义了一个或多个要查找的关键字要提取的字段的正确模板和regexp。他们可以也可以是静态值，如公司全名。

模板文件按字母顺序尝试。

我们可以将它们扩展到发票期间使用的功能选项处理。

示例：

issuer: Amazon Web Services, Inc.
keywords:
- Amazon Web Services
fields:
  amount: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
  amount_untaxed: TOTAL AMOUNT DUE ON.*\$(\d+\.\d+)
  date: Invoice Date:\s+([a-zA-Z]+ \d+ , \d+)
  invoice_number: Invoice Number:\s+(\d+)
  partner_name: (Amazon Web Services, Inc\.)
options:
  remove_whitespace: false
  currency: HKD
  date_formats:
    - '%d/%m/%Y'
lines:
    start: Detail
    end: \* May include estimated US sales tax
    first_line: ^    (?P<description>\w+.*)\$(?P<price_unit>\d+\.\d+)
    line: (.*)\$(\d+\.\d+)
    last_line: VAT \*\*

开发

如果你有兴趣改进这个项目，看看我们的 developer guide让你快速开始。

路线图和开放任务

与在线ocr集成？
尝试“猜测”新发票格式的参数。
可以应用机器学习来猜测新参数吗？

维护人员

贡献者

Harshit Joshi：作为谷歌之夏代码学生。
Holger Brunn：添加对解析的支持发票项目。

欢迎加入QQ群-->： 979659372

invoice2data 0.3.5

invoice2data的Python项目详细描述

安装

用法

模板系统

开发

路线图和开放任务

维护人员

贡献者

推荐PyPI第三方库

suirenshi

CircuitSeeker

Airport-Monitor

pyfireconnect

django_autocode_tools

sangreal-bt

xlsx2pdf

bearlib-p

test-module-railgun

optparseprett

inveniobase

foo1021

blokus-gym

onhm

funniest

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

invoice2data 0.3.5

invoice2data的Python项目详细描述

安装

用法

模板系统

开发

路线图和开放任务

维护人员

贡献者

相关项目

推荐PyPI第三方库

suirenshi

CircuitSeeker

Airport-Monitor

pyfireconnect

django_autocode_tools

sangreal-bt

xlsx2pdf

bearlib-p

test-module-railgun

optparseprett

inveniobase

foo1021

blokus-gym

onhm

funniest

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签