写入文件时出现错误的utf8字符（python）

def sortWords(subject, articles, stopWordsFile): stopWords = [] f = open(stopWordsFile) for lines in f: stopWords.append(lines.split(None, 1)[0].lower()) for x in range(0,len(articles)): f = open(articles[x], 'r') article = f.read().lower() article = re.sub("[^a-zA-Z\æøåÆØÅöÖüÜ\ ]+", " ", article) article = [word for word in article.split() if word not in stopWords] print ' '.join(article) w = codecs.open(subject+str(x)+'.txt', 'w+') w.write(' '.join(article)) sortWords("hpv", ["vaccine_texts/hpv1.txt"], "stopwords.txt")

2条回答

网友

1楼 · 编辑于 2024-10-01 00:36:00

当您在文本文件中看到类似于ÃН（或者更普遍地说是2个字符，其中第一个字符是Ã），很可能文件是用UTF8正确编写的，而编辑器（或屏幕）无法正确处理UTF8。在

让我们看看æ。它是unicode字符U+E6。当你用utf8编码它时，它给出两个字符b'\xc3\xa6'，当它被解码为拉丁1时，它将输出'Ã¦'。在

你能做什么来确认？使用优秀的vim编辑器，它了解多种编码和其他utf8，至少当您使用它的图形界面gvim时。在

还有一个一般性的建议：永远不要在python源文件中写入非ascii字符，除非您将# -*- coding: ... -*-行作为第一行（如果第一行是hashbang行，则是第二行）

如果您想在Windows下使用unicode和Python，那么一定要使用本机处理它的IDLE。在

TL/DR：如果您使用的是Linux，很可能您的系统本机配置为使用utf8编码，并且您可以正确地用utf8编写文本文件，但是您的文本编辑器无法正确显示utf8

网友

2楼 · 编辑于 2024-10-01 00:36:00

您是否尝试过：

w.write( ' '.join(article).encode('utf8') )

别忘了关闭文件（最好使用with上下文管理器来操作文件）

相关问题更多 >

编程相关推荐

热门问题

热门文章