我正在实现一个应用程序,我有一个场景,那就是读取文件后,正常化它,但在读取文件时,我得到以下错误:下面是我的尝试
def unicodeToAscii(self,s):
return ''.join(c for c in unicodedata.normalize('NFD',s) if unicodedata.category(c)!='Mn')
def normalizeString(self,s):
s=self.unicodeToAscii(s.lower().strip())
s=re.sub(r"([.!?])",r" \1",s)
s=re.sub(r"([^a-zA-Z.!?])",r" ",s)
s=re.sub(r"(\s+)",r" ",s).strip()
return s
dataFile=os.path.join('/home/amit/Downloads/cornell_movie_dialogs_corpus/cornell movie-dialogs corpus','formatted_movie_lines')
print('please wait .. reading a file')
lines =open(dataFile).read().strip().split('\n')
vocal=Vocabulary()
pairs=[[vocal.normalizeString(unicode(s))for s in pair.split('\t')] for pair in lines]
print('done reading')
错误:
please wait .. reading a file
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-2-4142a7dbef84> in <module>()
118 lines =open(dataFile).read().strip().split('\n')
119 vocal=Vocabulary()
--> 120 pairs=[[vocal.normalizeString(unicode(s))for s in pair.split('\t')] for pair in lines]
121 print('done reading')
122
UnicodeDecodeError: 'ascii' codec can't decode byte 0xad in position 28: ordinal not in range(128)
您正在执行的Unicode规范化是而不是将所有内容转换为ASCII。它只应用Unicode规范化,确保variant encodings都以相同的方式表示。(此外,对于
Mn
类别,您正在避免这种情况,因此规范化也不完整。)值得一提的是,U+00AD是一个软连字符,与绝大多数Unicode字符一样,它没有对应的纯ASCII字符,不过您可以用一个常规的破折号/减号/连字符
-
来近似它。内置的'replace'
功能将用问号代替它,不过:相关问题 更多 >
编程相关推荐