使用Python获取PDF附件

2024-09-28 01:24:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我承认我是Python新手。 我们必须处理带有附件或注释附件的PDF文件。我正在尝试使用PyPDF2库从PDF文件中提取附件

GitHub上唯一的(!)示例包含以下代码:

import PyPDF2

def getAttachments(reader):
      
      catalog = reader.trailer["/Root"]
      # VK
      print (catalog)
      
          # 
      fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']

电话是:

rootdir = "C:/Users/***.pdf"  # My file path
handler = open(rootdir, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)

我收到一个键错误:'/EmbeddedFiles'

目录的打印确实不包含嵌入文件: {'/Extensions':{'/ADBE':{'/BaseVersion':'/1.7','/ExtensionLevel':3}},'/Metadata':间接对象(2,0),'/Names':间接对象(5,0),'/OpenAction':间接对象(6,0),'/PageLayout':'/OneColumn','/Pages':间接对象(3,0),'/PieceInfo':间接对象(7,0),'/Type':'/Catalog'}

此特定PDF包含9个附件。我怎样才能得到它们


Tags: 文件对象github示例附件pdfnamesreader
2条回答

评论太长了,我还没有亲自测试过这段代码,这段代码看起来与你在问题中的大纲非常相似,但是我在这里添加代码供其他人测试。它是Pull请求https://github.com/mstamy2/PyPDF2/pull/440的主题,下面是Kevin M Loeffler在https://kevinmloeffler.com/2018/07/08/how-to-extract-pdf-file-attachments-using-python-and-pypdf2/中描述的完整更新序列

可在https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38#file-extract_pdf_attachments-py查看

下载为 https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38/raw/acdc194058f9fa2c4d2619a4c623d0efeec32555/extract_pdf_attachments.py

如果您能提供一个您有问题的类型的示例输入,以便其他人能够调整提取例程以适合您,那么它总是有帮助的

对收到错误的响应 “我猜脚本正在崩溃,因为PDF的embedded files部分并不总是存在,因此尝试访问它会引发错误。” “我会尝试将所有内容放在try catch中get_attachments方法的‘catalog’行之后。”

不幸的是,有许多未包含在PyPDF2https://github.com/mstamy2/PyPDF2/pulls中的未决请求,其他请求也可能与此相关或需要帮助解决此缺陷和其他缺陷。因此,你需要看看这些是否也有帮助

对于一个您可能能够包括/并适应其他用例的try-catch的挂起示例,请参见https://github.com/mstamy2/PyPDF2/pull/551/commits/9d52ef517319b538f007669631ba6b778f8ec3a3

除了/Type/EmbeddedFiles之外的嵌入文件的相关关键字包括/Type /Filespec&/Subtype /FileAttachment请注意,这些对可能并不总是有空格,因此可以查看这些对是否可以查询附件

同样在最后一点上,该示例搜索以复数形式编制索引的/EmbeddedFiles,而任何单个条目本身都被标识为单数

这是可以改进的,但是已经测试过了(使用PyMuPDF)。
它检测损坏的PDF文件、加密、附件、批注和公文包。
我还没有将输出与我们的内部分类进行比较。
生成可以导入Excel的分号分隔文件

import fitz                      # = PyMuPDF
import os

outfile = open("C:/Users/me/Downloads/testPDF3.txt", "w", encoding="utf-8")
folder = "C:/Users/me/Downloads"

print ("filepath;","encrypted;","pages;", "embedded;","attachments;","annotations;","portfolio",  file = outfile)
enc=pages=count=names=annots=collection=''

for subdir, dirs, files in os.walk(folder):
    for file in files:
        #print (os.path.join(subdir, file))
        filepath = subdir + os.sep + file

        if filepath.endswith(".pdf"):
            #print (filepath, file = outfile)
            
            try:
                doc = fitz.open(filepath)
 
                enc = doc.is_encrypted
                #print("Encrypted? ", enc, file = outfile)
                pages = doc.page_count
                #print("Number of pages: ", pages, file = outfile)
                count = doc.embfile_count()
                #print("Number of embedded files:", count, file = outfile)     # shows number of embedded files
                names = doc.embfile_names()
                #print("Embedded files:", str(names), file = outfile) 
                #if count > 0:
                #    for emb in names:
                #        print(doc.embfile_info(emb), file = outfile)
                annots = doc.has_annots()
                #print("Has annots?", annots, file = outfile) 
                
                links = doc.has_links()
                #print("Has links?", links, file = outfile)
                trailer = doc.pdf_trailer()
                #print("Trailer: ", trailer, file = outfile)
                xreflen = doc.xref_length()  # length of objects table
                for xref in range(1, xreflen):  # skip item 0!
                    #print("", file = outfile)
                    #print("object %i (stream: %s)" % (xref, doc.is_stream(xref)), file = outfile)
                    #print(doc.xref_object(i, compressed=False), file = outfile)
                    
                    if "Collection" in doc.xref_object(xref, compressed=False): 
                        #print ("Portfolio", file = outfile)
                        collection ='True'
                        break
                    else: collection="False"
                    #print(doc.xref_object(xref, compressed=False), file = outfile)
                    
            except:
                #print ("Not a valid PDF", file = outfile)
                enc=pages=count=names=annots=collection="Not a valid PDF"
            print(filepath,";", enc,";",pages, ";",count, ";",names, ";",annots, ";",collection, file = outfile )                
outfile.close()

相关问题 更多 >

    热门问题