PDF文件到Dict返回奇怪的字符问题的回答

PDF文件到Dict返回奇怪的字符

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我试图创建一个程序，利用pdfminer读取一个DnD字符表（filleblepdf），并把填充到字典。在编辑PDF并再次运行程序时，我在打印字典项时得到一个奇怪的字符序列。代码： <pre class="lang-py prettyprint-override"><code>from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdftypes import resolve1 import collections filename = "Edited_CS.pdf" fp = open(filename, 'rb') my_dict = {} parser = PDFParser(fp) doc = PDFDocument(parser) fields = resolve1(doc.catalog['AcroForm'])['Fields'] # Checks if PDF file is blank if isinstance(fields, collections.abc.Sequence) is False: print("This Character Sheet is blank. Please submit a filled Character Sheet!") else: for i in fields: field = resolve1(i) name, value = field.get('T'), field.get('V') if value is None or str(value)[2:-1] == "": value = "b'None'" my_dict[str(name)[2:-1]] = str(value)[2:-1] for g in list(my_dict.items()): print(g) </code></pre> 未编辑PDF文件的输出： <pre class="lang-py prettyprint-override"><code>('ClassLevel', 'Assassin 1') ('Background', 'Lone Survivor') ('PlayerName', 'None') ('CharacterName', 'Tumas Mitshil') ('Race ', 'Human') etc... </code></pre> 编辑时的输出（我在PDF中完全更改了类级别等）： <pre class="lang-py prettyprint-override"><code>('ClassLevel', '\\xfe\\xff\\x00C\\x00l\\x00a\\x00s\\x00s\\x00L\\x00e\\x00v\\x00e\\x00l') ('Background', '\\xfe\\xff\\x00B\\x00a\\x00c\\x00k\\x00g\\x00r\\x00o\\x00u\\x00n\\x00d\\x00r') ('PlayerName', '\\xfe\\xff\\x00P\\x00l\\x00a\\x00y\\x00e\\x00r\\x00N\\x00a\\x00m\\x00e') ('CharacterName', '\\xfe\\xff\\x00T\\x00h\\x00o\\x00m\\x00a\\x00s') ('Race ', '\\xfe\\xff\\x00R\\x00a\\x00c\\x00e') ('Alignment', '\\xfe\\xff\\x00A\\x00l\\x00i\\x00g\\x00n\\x00m\\x00e\\x00n\\x00t') etc... </code></pre> 我知道这是某种编码，一些谷歌搜索让我相信这是UTF-8编码，所以我试图在打开文件时解码PDF： <pre class="lang-py prettyprint-override"><code>fp = open(filename, 'rb').read().decode('utf-8') </code></pre> 不幸的是，我遇到了一个错误： <pre class="lang-py prettyprint-override"><code>Traceback (most recent call last): File "main.py", line 16, in <module> fp = open(filename, 'rb').read().decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte </code></pre> 当我第一次制作PDF时，我使用adobeacrobat。但是，我使用了microsoftedge来编辑文件，这导致了我所面临的问题。以下是文件： <a href="https://drive.google.com/file/d/11ZmvDrkWOke-YuFhfhWJUmi-Nf8bv3wZ/view?usp=sharing" rel="nofollow noreferrer">Original File</a> <a href="https://drive.google.com/file/d/1Leil3lFvDwEMpZa9zNm9URhDtflgcuMH/view?usp=sharing" rel="nofollow noreferrer">Edited File</a> 有什么方法可以正确地解码这个吗？有没有一种方法可以对编辑过的pdf进行编码，这样就可以轻松地加载到python中？如果这是编码的，还有其他形式的编码吗？我该如何解码？你知道吗 任何帮助都将不胜感激。你知道吗

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

PDF文件到Dict返回奇怪的字符

1 个回答

相关Python问题