自定义语料库中Unicode的NLTK解码

import nltk from nltk.corpus import CategorizedPlaintextCorpusReader import os mr = CategorizedPlaintextCorpusReader('C:\mycorpus', r'(?!\.).*\.txt', cat_pattern=os.path.join(r'(neg|pos)', '.*',)) for w in mr.words(): print(w)

1条回答

网友

1楼 · 发布于 2024-09-27 07:26:42

问题是NLTK的语料库阅读器假设您的纯文本文件是用UTF-8编码保存的。然而，这个假设显然是错误的，因为文件是用另一个编解码器编码的。我的猜测是使用了CP1252（又名“Windows拉丁语-1”），因为它非常流行，而且很适合您的描述：在这种编码中，em破折号“–”是用字节0x96编码的，这在错误消息中提到过。在

您可以在语料库读取器的构造函数中指定输入文件的编码：

mr = CategorizedPlaintextCorpusReader(
    'C:\mycorpus',
    r'(?!\.).*\.txt',
    cat_pattern=os.path.join(r'(neg|pos)', '.*',),
    encoding='cp1252')

试试这个，然后检查输出中的非ASCII字符（em dash、bullet）是否仍然正确（并且没有被mojibake替换）。在

相关问题更多 >

编程相关推荐

热门问题

热门文章