PDF文件到Dict返回奇怪的字符

from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdftypes import resolve1 import collections filename = "Edited_CS.pdf" fp = open(filename, 'rb') my_dict = {} parser = PDFParser(fp) doc = PDFDocument(parser) fields = resolve1(doc.catalog['AcroForm'])['Fields'] # Checks if PDF file is blank if isinstance(fields, collections.abc.Sequence) is False: print("This Character Sheet is blank. Please submit a filled Character Sheet!") else: for i in fields: field = resolve1(i) name, value = field.get('T'), field.get('V') if value is None or str(value)[2:-1] == "": value = "b'None'" my_dict[str(name)[2:-1]] = str(value)[2:-1] for g in list(my_dict.items()): print(g)

('ClassLevel', '\\xfe\\xff\\x00C\\x00l\\x00a\\x00s\\x00s\\x00L\\x00e\\x00v\\x00e\\x00l') ('Background', '\\xfe\\xff\\x00B\\x00a\\x00c\\x00k\\x00g\\x00r\\x00o\\x00u\\x00n\\x00d\\x00r') ('PlayerName', '\\xfe\\xff\\x00P\\x00l\\x00a\\x00y\\x00e\\x00r\\x00N\\x00a\\x00m\\x00e') ('CharacterName', '\\xfe\\xff\\x00T\\x00h\\x00o\\x00m\\x00a\\x00s') ('Race ', '\\xfe\\xff\\x00R\\x00a\\x00c\\x00e') ('Alignment', '\\xfe\\xff\\x00A\\x00l\\x00i\\x00g\\x00n\\x00m\\x00e\\x00n\\x00t') etc...

Traceback (most recent call last): File "main.py", line 16, in <module> fp = open(filename, 'rb').read().decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

2条回答

网友

1楼 · 编辑于 2024-06-28 20:28:10

您可以通过使用adobeacrobatreaderdc编辑表单字段来解决此问题。我已经使用它编辑了Edited_CS.pdf的表单字段，并且pdfminer.6号返回预期输出。你知道吗

可能是Microsoft Edge导致了此问题。你知道吗

网友

2楼 · 编辑于 2024-06-28 20:28:10

经过一番挖掘，我找到了更好的解决办法。我没有使用pdfminer来打开PDF，而是使用PyPDF2。不知何故，它可以读取任何PDF而不考虑编码，它有一个功能，可以自动将可填充的空间变成一个适当的字典。结果是生成更精细、更清晰的代码：

from PyPDF2 import PdfFileReader

infile = "Edited_CS.pdf"
pdf_reader = PdfFileReader(open(infile, "rb"))

dictionary = pdf_reader.getFormTextFields()

for g in list(dictionary.items()):
    print(g)

不管怎样，谢谢你的回答！：）

相关问题更多 >

编程相关推荐

热门问题

热门文章