如何有效地分割utf8编码的fi

def readfiles(filepaf): with codecs.open(filepaf,'r', 'utf-8') as f: g=f.read() q=' '.join(g.split()) return q q=readfiles(c:xxx) q=Katharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shutting of a door... >>> q[0:100] u'\ufeffKatharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin' >>> q[0:100].encode('utf-8') '\xef\xbb\xbfKatharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin'

1条回答

网友

1楼 · 发布于 2024-06-25 05:43:59

丢弃从片的开头以第10位开始的字节，直到找到一个不是位10的字节。这个字节将开始一个新的字符。你最多只能跳过3个字节。在

或者，你不能给字符串一个断片。在

请注意\ufeff是一个有效字符：它是零宽度的非中断空格，一些断开的文本编辑器会插入UTF8文件的开头以标识它们。如果要跳过它，请使用utf-8-sig编码。在

相关问题更多 >

编程相关推荐

热门问题

热门文章