如何使用Python将表格作为文本从PDF中提取出来？

import PyPDF2 PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object pg4 = pfr.getPage(126) #extract pg 127 writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object #add pages writer.addPage(pg4) NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be with open(NewPDFfilename, "wb") as outputStream: writer.write(outputStream) #write pages to new PDF

3条回答

网友

1楼 · 编辑于 2024-09-23 10:21:47

2019年对这个问题的更新，因为我每次搜索“python extract pdf table”时都会被指向这里

有一个名为camelot/excalibur的python解决方案

https://github.com/atlanhq/camelot

网友

2楼 · 编辑于 2024-09-23 10:21:47

在我看来，你有四种可能：

您可以使用tabula
您可以使用pdf to text将pdf转换为文本，然后使用python解析文本
您可以使用外部工具，将pdf文件转换为excel或csv，然后使用必需的python模块打开excel/csv文件。
您还可以将pdf转换为图像文件，然后使用任何最新的OCR软件（自动从图片重建表格）来获取数据

你的问题与以下类似：

问候

网友

3楼 · 编辑于 2024-09-23 10:21:47

我建议你用表格把这张桌子取出来。将pdf作为参数传递给tablaapi，它将以dataframe的形式返回表。pdf中的每个表都作为一个数据帧返回。这是我提取pdf的代码。

#the table will be returned in a list of dataframe,for working with dataframe you need pandas
import pandas as pd
import tabula
file = "filename.pdf"
path = 'enter your directory path here'  + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
print(df)

请参阅我的repo了解更多详细信息。

相关问题更多 >

编程相关推荐

热门问题

热门文章