Unicode编码错误：“charmap”编解码器无法对位置2090中的字符“\ufb01”进行编码：字符映射到<undefined>

def main(): path =r"D drive where images are stored" fullTempPath =r"D drive where extracted texts are stored in xls file" for imageName in os.listdir(path): inputPath = os.path.join(path, imageName) img = Image.open(inputPath) text = pytesseract.image_to_string(img, lang ="eng") file1 = open(fullTempPath, "a+") file1.write(imageName+"\n") file1.write(text+"\n") file1.close() file2 = open(fullTempPath, 'r') file2.close() if __name__ == '__main__': main()

UnicodeEncodeError Traceback (most recent call last) <ipython-input-7-fb69795bce29> in <module> 13 file2.close() 14 if __name__ == '__main__': ---> 15 main() <ipython-input-7-fb69795bce29> in main() 8 file1 = open(fullTempPath, "a+") 9 file1.write(imageName+"\n") ---> 10 file1.write(text+"\n") 11 file1.close() 12 file2 = open(fullTempPath, 'r') ~\anaconda3\lib\encodings\cp1252.py in encode(self, input, final) 17 class IncrementalEncoder(codecs.IncrementalEncoder): 18 def encode(self, input, final=False): ---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0] 20 21 class IncrementalDecoder(codecs.IncrementalDecoder): UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 2090: character maps to <undefined>

3条回答

网友

1楼 · 编辑于 2024-09-30 20:30:19

我不知道为什么Tesseract会返回一个包含无效Unicode字符的字符串，但这似乎就是正在发生的事情。可以告诉Python忽略编码错误。尝试将打开输出文件的行更改为以下内容：

file1 = open(fullTempPath, "a+", errors="ignore")

网友

2楼 · 编辑于 2024-09-30 20:30:19

text='此文本上的unicode错误' text=text.decode（'utf-8'）尝试解码文本

网友

3楼 · 编辑于 2024-09-30 20:30:19

用于open的默认文件编码是locale.getpreferredencoding(False)返回的值，在Windows上，该值通常是不支持所有Unicode字符的传统编码。在这种情况下，错误消息表明它是cp1252（又称Windows-1252）。最好明确指定所需的编码。UTF-8处理所有Unicode字符：

file1 = open(fullTempPath, "a+", encoding='utf8')

仅供参考，U+FB01是拉丁小连字FI（ﬁ），如果这对正在处理的图像有意义的话

此外，Windows编辑器倾向于采用相同的传统编码，除非编码为utf-8-sig，这会将编码的BOM字符添加到文件的开头，作为UTF-8的编码提示

相关问题更多 >

编程相关推荐

热门问题

热门文章