使用Python获取PDF附件

2条回答

网友

1楼 · 编辑于 2024-09-28 01:24:58

评论太长了，我还没有亲自测试过这段代码，这段代码看起来与你在问题中的大纲非常相似，但是我在这里添加代码供其他人测试。它是Pull请求https://github.com/mstamy2/PyPDF2/pull/440的主题，下面是Kevin M Loeffler在https://kevinmloeffler.com/2018/07/08/how-to-extract-pdf-file-attachments-using-python-and-pypdf2/中描述的完整更新序列

可在https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38#file-extract_pdf_attachments-py查看

下载为 https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38/raw/acdc194058f9fa2c4d2619a4c623d0efeec32555/extract_pdf_attachments.py

如果您能提供一个您有问题的类型的示例输入，以便其他人能够调整提取例程以适合您，那么它总是有帮助的

对收到错误的响应 “我猜脚本正在崩溃，因为PDF的embedded files部分并不总是存在，因此尝试访问它会引发错误。” “我会尝试将所有内容放在try catch中get_attachments方法的‘catalog’行之后。”

不幸的是，有许多未包含在PyPDF2https://github.com/mstamy2/PyPDF2/pulls中的未决请求，其他请求也可能与此相关或需要帮助解决此缺陷和其他缺陷。因此，你需要看看这些是否也有帮助

对于一个您可能能够包括/并适应其他用例的try-catch的挂起示例，请参见https://github.com/mstamy2/PyPDF2/pull/551/commits/9d52ef517319b538f007669631ba6b778f8ec3a3

除了/Type/EmbeddedFiles之外的嵌入文件的相关关键字包括/Type /Filespec&/Subtype /FileAttachment请注意，这些对可能并不总是有空格，因此可以查看这些对是否可以查询附件

同样在最后一点上，该示例搜索以复数形式编制索引的/EmbeddedFiles，而任何单个条目本身都被标识为单数

网友
2楼 · 编辑于 2024-09-28 01:24:58

这是可以改进的，但是已经测试过了（使用PyMuPDF）。
它检测损坏的PDF文件、加密、附件、批注和公文包。
我还没有将输出与我们的内部分类进行比较。
生成可以导入Excel的分号分隔文件
import fitz # = PyMuPDF import os outfile = open("C:/Users/me/Downloads/testPDF3.txt", "w", encoding="utf-8") folder = "C:/Users/me/Downloads" print ("filepath;","encrypted;","pages;", "embedded;","attachments;","annotations;","portfolio", file = outfile) enc=pages=count=names=annots=collection='' for subdir, dirs, files in os.walk(folder): for file in files: #print (os.path.join(subdir, file)) filepath = subdir + os.sep + file if filepath.endswith(".pdf"): #print (filepath, file = outfile) try: doc = fitz.open(filepath) enc = doc.is_encrypted #print("Encrypted? ", enc, file = outfile) pages = doc.page_count #print("Number of pages: ", pages, file = outfile) count = doc.embfile_count() #print("Number of embedded files:", count, file = outfile) # shows number of embedded files names = doc.embfile_names() #print("Embedded files:", str(names), file = outfile) #if count > 0: # for emb in names: # print(doc.embfile_info(emb), file = outfile) annots = doc.has_annots() #print("Has annots?", annots, file = outfile) links = doc.has_links() #print("Has links?", links, file = outfile) trailer = doc.pdf_trailer() #print("Trailer: ", trailer, file = outfile) xreflen = doc.xref_length() # length of objects table for xref in range(1, xreflen): # skip item 0! #print("", file = outfile) #print("object %i (stream: %s)" % (xref, doc.is_stream(xref)), file = outfile) #print(doc.xref_object(i, compressed=False), file = outfile) if "Collection" in doc.xref_object(xref, compressed=False): #print ("Portfolio", file = outfile) collection ='True' break else: collection="False" #print(doc.xref_object(xref, compressed=False), file = outfile) except: #print ("Not a valid PDF", file = outfile) enc=pages=count=names=annots=collection="Not a valid PDF" print(filepath,";", enc,";",pages, ";",count, ";",names, ";",annots, ";",collection, file = outfile ) outfile.close()

相关问题更多 >

编程相关推荐

热门问题

热门文章