我试图搜索作者的pdf文件，不允许使用任何第三方pdf模块

import re f=bytes("k://file.pdf",'ascii') open("k://file.pdf") for line in f: if re.match("(.*)(Author)(.*)", line): print (line), The error message I get is: Traceback (most recent call last): File "K:\hw3pdftest.py", line 8, in <module> if re.match("(.*)(Author)(.*)", line): File "C:\Python34\lib\re.py", line 160, in match return _compile(pattern, flags).match(string) TypeError: expected string or buffer

Traceback (most recent call last): File "K:\hw3pdftest.py", line 6, in <module> for line in f: File "C:\Python34\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 515: character maps to <undefined>

2条回答

网友

1楼 · 编辑于 2024-10-03 11:16:01

这是不对的：

f = bytes("k://file.pdf",'ascii')
for line in f:
    ...

您不是迭代pdf中的行，而是迭代b'k://file.pdf'中的字节值，即字符k、:、/的ASCII码，这些字符是整数。你应该做：

f = open("k://file.pdf")
for line in f:
    ...

网友

2楼 · 编辑于 2024-10-03 11:16:01

PDF将作者姓名（根据official PDF Specification）存储为PDF字典中的以下键：

/Author (John Doe)

因此，您应该尝试对PDF文件运行以下正则表达式

\/Author.+\((.+\))

它将返回匹配的作者名#1。请注意，在某些情况下，您可能需要对该字符串进行额外的解码（如果它使用Unicode符号，则可以通过特殊方式对其进行编码）

相关问题更多 >

编程相关推荐

热门问题

热门文章