java自动从许多文件的pdf中提取文本

9 月，2 周 Questions & Answers 887

我有大约10000个pdf文件（conf文件），我需要从这些文件的某些部分（如实验部分）提取文本并保存在一个文件中。有谁知道java工具或python工具可以帮我做到这一点吗

提前谢谢

阿尤什

# 1 楼答案

既然这些是学术论文，你也应该看看lapdftext

LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance where needed). The system is open-source and provides a simple baseline function for extracting text from primary research articles using rules that developers can customize.
# 2 楼答案

发帖前你研究过你的问题吗？我刚在谷歌上找到了这个Apache项目：http://pdfbox.apache.org/
# 3 楼答案

对于java：看看iText

对于python，我将使用PDFMiner

Python中文网