更改文本fi时出现UnicodeDecodeError

import string infile = open("unigram.wfreq","r") outfile = open("bigram.txt","w") line = "Start" while line != "": line = infile.readline() wordandcount = line.split() word = wordandcount[0] ##Fix å ä ö. ## å == √• ä == √§ ö == √∂ if "√•" in word or "√§" in word or "√∂" in word: word = word.replace("√•","å") word = word.replace("√§","ä") word = word.replace("√∂","ö") if word.isalpha(): word = word.lower() outfile.write(word+"\n") print(line)

Traceback (most recent call last): File "formater.py", line 13, in <module> line = infile.readline() File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 2732-2733: invalid continuation byte

3条回答

网友

1楼 · 编辑于 2024-09-28 01:28:17

如果f√∂gl√∂mma在示例文件中，并且应该读取föglömma，但是Python脚本认为它不是UTF-8，那么您在unigram.wfreq文件中添加了错误的编码。在

在某些时候，UTF-8数据被解释为mac roman，然后保存为mac roman。在

通过再次将文件保存到UTF-8，您已经进一步烘焙了以前的错误。在

网友

2楼 · 编辑于 2024-09-28 01:28:17

所以我找到了解决这个问题的简单方法。我用升华文本2打开了我的wfreq文件，在那里我可以用编码utf-8保存它。这就解决了瑞典字母问题。我还将扩展名改为.txt。之后，我再次运行python代码（更改了文件名并删除了åä-part），它运行得很好。在

网友

3楼 · 编辑于 2024-09-28 01:28:17

文件看起来像是用UTF-8编码的，但您使用的是mac_roman编码来显示它。这是一个测试：

#coding:utf8
data = u'mammutslätten föglömma'
print data.encode('utf8').decode('mac_roman')

输出：

^{pr2}$

要在Python中正确读取文件，请使用以下命令使用正确的编码读取Unicode字符串：

import io
with io.open('unigram.wfreq',encoding='utf8') as f:
    for line in f:
        print line.strip()

输出：

gruppselektion 4
lating 1
Morsing 2
varuhusen 7
FULLT 8
latino 3
mammutslätten 2
föglömma 1
varuhuset 47
livsnjutningen 1
nedtoning 1

相关问题更多 >

编程相关推荐

热门问题

热门文章