如何使Python2.6函数与Unicode一起工作？

def openbookreturnvocab(book): fileopen = open(book) rawness = fileopen.read() tokens = nltk.wordpunct_tokenize(rawness) nltktext = nltk.Text(tokens) nltkwords = [w.lower() for w in nltktext] nltkvocab = sorted(set(nltkwords)) return nltkvocab

def openbookreturnvocab(book): fileopen = open(book) rawness = fileopen.read() unirawness = rawness.decode('utf-8') tokens = nltk.wordpunct_tokenize(unirawness) nltktext = nltk.Text(tokens) nltkwords = [w.lower() for w in nltktext] nltkvocab = sorted(set(nltkwords)) return nltkvocab

>>> import bookroutines >>> elles1 = bookroutines.openbookreturnvocab("lk1-les1") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "bookroutines.py", line 9, in openbookreturnvocab nltktext = nltk.Text(tokens) File "/usr/lib/pymodules/python2.6/nltk/text.py", line 285, in __init__ self.name = " ".join(map(str, tokens[:8])) + "..." UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "bookroutines.py", line 23, in jotindex filemydata.write(jottedf) UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 414: ordinal not in range(128) >>>

>>> jottedf = u'/n'.join(elles1) >>> filemydata.write(jottedf) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 504: ordinal not in range(128)

1条回答

网友

1楼 · 发布于 2024-09-29 21:42:29

对于从文件中读取的每个字符串，如果文本是UTF-8格式的，可以通过调用rawness.decode('utf-8')将它们转换为unicode。您将得到unicode对象。另外，我不知道“jotted”是什么，但您可能需要确保它是unicode对象，并使用u'\n'.join(jotted)。在

更新：

NLTK库似乎不喜欢unicode对象。好吧，那么您必须确保您使用的str实例包含UTF-8编码文本。试着使用这个：

tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text([token.encode('utf-8') for token in tokens])

还有这个：

^{pr2}$

但是，如果jotted真的是一个UTF-8编码str的列表，那么您不需要这个，这就足够了：

jottedf = '\n'.join(jotted)
filemydata.write(jottedf)

顺便说一句，NLTK似乎对unicode和编码（至少是演示）并不十分谨慎。最好小心点，检查它是否正确地处理了你的令牌。此外，这可能导致匈牙利语文本而非德语文本出现错误，请检查编码。在

更新：

相关问题更多 >

编程相关推荐

热门问题

热门文章