从pdf文档中轻松提取文本。
slate的Python项目详细描述
slate是一个python包,它简化了提取过程 PDF文件中的文本。这取决于pdfminer包。
slate提供了一个类,pdf。pdf接受一个类似文件的对象 将从文档中提取所有文本,表示每一页 作为文本字符串:
>>> with open('example.pdf') as f: ... doc = slate.PDF(f) ... >>> doc [..., ..., ...] >>> doc[1] 'Text from page 2...'
如果您的pdf受密码保护,请将密码作为 第二个参数:
>>> with open('secrets.pdf') as f: ... doc = slate.PDF(f, 'password') ... >>> doc[0] "My mother doesn't know this, but..."
更复杂的操作
如果您想访问图像、字体文件和其他 信息,然后花点时间学习pdfminer api。
pdfminer怎么了?
- Getting simple things done, like extracting the text is quite complex. The program is not designed to return Python objects, which makes interfacing things irritating.
- It’s an extremely complete set of tools, with multiple and moderately steep learning curves.
- It’s not written with hackability in mind.