Python：将RTF文件转换为unicode？

usefulLines = [] textData = {} # the regex pattern for an entry in the db (e.g. SUF 76,22): it's sufficient for us to match on three upper-case characters plus a space entryPattern = '^([A-Z]{3})[\s].*$' f = open('textbase_1a.rtf', 'Ur') fileLines = f.readlines() # get the matching line numbers, and store in usefulLines for i, line in enumerate(fileLines): #line = line.decode('utf-16be') # this causes an error: I don't really know what file encoding the RTF file is in... line = line.decode('mac_roman') print line if re.match(entryPattern, line): # now retrieve the following lines, all the way up until we get a blank line print "match: " + str(i) usefulLines.append(i)

1条回答

网友

1楼 · 发布于 2024-10-01 09:26:41

你甚至没有解码RTF文件。rtf是简单的文本文件。例如，包含“äü”的文件包含以下内容：

{\rtf1\ansi\ansicpg1252\deff0\deflang1031{\fonttbl{\f0\fswiss\fcharset0 Arial;}}
{*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\f0\fs20\'e4\'f6\'fc\par
}

在文本编辑器中打开时。因此，字符“äü”被编码为windows-1252，正如文件开头所声明的那样（äü=0xE4 0xF6 0xFC）。在

要阅读RTF，首先需要将RTF转换为文本（已经是asked here）。在

相关问题更多 >

编程相关推荐

热门问题

热门文章