LA-PDFText is a system for extracting accurate text from PDF-based
research articles (and an interface to be able to improve performance
where needed). The system is open-source and provides a simple
baseline function for extracting text from primary research articles
using rules that developers can customize.
# 1 楼答案
既然这些是学术论文,你也应该看看lapdftext
# 2 楼答案
发帖前你研究过你的问题吗?我刚在谷歌上找到了这个Apache项目:http://pdfbox.apache.org/
# 3 楼答案
对于java:看看iText
对于python,我将使用PDFMiner