使用Python更改文本文件的编码：这是不可能的

import sys import os import io #from chardet.universaldetector import UniversalDetector BLOCKSIZE = 1048576 #detector = UniversalDetector() #def get_encoding( current_file ): # detector.reset() # for line in file(current_file): # detector.feed(line) # if detector.done: break # detector.close() # return detector.result['encoding'] def main(): src_dir = "" if len( sys.argv ) > 1: src_dir = sys.argv[1] if os.path.exists( src_dir ): dest_dir = src_dir[:-2] for file in os.listdir( src_dir ): with io.open( os.path.join( src_dir, file ), "r", encoding='cp1252') as source_file: with io.open( os.path.join( dest_dir, file ), "w", encoding='utf8') as target_file: while True: contents = source_file.read( BLOCKSIZE ) if not contents: break target_file.write( contents ) #print( "Encoding of " + file + ": " + get_encoding( os.path.join( dest_dir, file ) ) ) else: print( 'The specified directory does not exist.' ) if __name__ == "__main__": main()

1条回答

网友

1楼 · 发布于 2024-10-01 13:28:30

ASCII是许多编码的通用子集。它是UTF-8、Latin-1和cp1252的一个子集，也是整个ISO-8859系列的一个子集，它有俄语、希腊语等的编码。如果你的文件真的是ASCII码，就没有什么可转换的了，你的系统只会说“cp1252”，因为这些文件与此兼容。您可以添加一个BOM来将一个文件标记为UTF（encodingutf-8-sig），但坦率地说，我不明白这一点。UTF实际上并不需要它，因为UTF文件可以通过多字节字符的结构进行识别。在

如果您想尝试编码，请使用包含非ASCII字符的文本：法语、俄语、中文，甚至英语中带有重音符号的单词（或者微软应用程序喜欢插入的愚蠢的直接引语）。把“Wikipédia en français”保存在一个文件中，然后重复你的实验，你会得到非常不同的结果。在

我强烈建议使用python3来实现这一点，也建议使用python3进行字符编码。Python2编码方法导致了许多毫无意义的混乱，事实上这也是破坏兼容性和引入Python3的主要原因之一。另外，在Python3中，您只需将open()与encoding参数一起使用。您不需要任何模块来更改编码。在

相关问题更多 >

编程相关推荐

热门问题

热门文章