Python 3.4在写入文件时删除或忽略表情符号

2024-10-03 02:46:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图解析XML文件,并将内容写入纯文本文件。我让程序一直工作到遇到表情符号为止,然后Python抛出以下错误:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 177-181: character maps to <undefined>

我转到错误位置,在XML文件中发现以下表情:

emoji

我的问题是,在写入文件时,如何将它们编码为unicode或完全删除/忽略它们

当我将print()输出到控制台时,它会完美输出,但在写入文件时会抛出一个错误

我搜索过谷歌和这里,但我得到的唯一答案是,它们已经编码为unicode。你能看到我的是文字吗?我不确定我说的是否正确

此外,我正在使用的XML文件具有以下格式:

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<?xml-stylesheet type="text/xsl" href="sms.xsl"?>
<smses count="1">
  <sms protocol="0" address="+00000000000" date="1346772606199" type="1" subject="null" body="Lorem ipsum dolor sit amet, consectetur adipisicing elit," toa="null" sc_toa="null" service_center="+00000000000" read="1" status="-1" locked="0" date_sent="1346772343000" readable_date="Sep 4, 2012 10:30:06 AM" contact_name="John Doe" />
</smses>

Tags: 文件内容编码datetype错误unicodexml
2条回答

您有两个选择:

  1. 选择可以处理表情符号代码点的编码。您已使用默认编解码器(取决于您的系统)打开文件进行写入,或者选择了不支持代码点的显式编码

    UTF编码可以很好地处理代码点;我在这里选择UTF-8:

    with open(filename, 'w', encoding='utf8') as outfile:
        outfile.write(yourdata)
    
  2. 设置错误处理模式,用替换字符、转义序列替换编解码器无法处理的代码点,或完全忽略它们。请参阅^{} functionerrors参数:

    errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:

    • 'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
    • 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
    • 'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
    • 'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
    • 'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
    • 'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.

    因此,使用errors='ignore'打开文件将不会写入表情符号代码点,而不会引发错误:

    with open(filename, 'w', errors='ignore') as outfile:
        outfile.write(yourdata)
    

演示:

>>> a_ok = 'The U+1F44C OK HAND SIGN codepoint: \U0001F44C'
>>> print(a_ok)
The U+1F44C OK HAND SIGN codepoint: 👌
>>> a_ok.encode('utf8')
b'The U+1F44C OK HAND SIGN codepoint: \xf0\x9f\x91\x8c'
>>> a_ok.encode('cp1251', errors='ignore')
b'The U+1F44C OK HAND SIGN codepoint: '
>>> a_ok.encode('cp1251', errors='replace')
b'The U+1F44C OK HAND SIGN codepoint: ?'
>>> a_ok.encode('cp1251', errors='xmlcharrefreplace')
b'The U+1F44C OK HAND SIGN codepoint: &#128076;'
>>> a_ok.encode('cp1251', errors='backslashreplace')
b'The U+1F44C OK HAND SIGN codepoint: \\U0001f44c'

请注意'surrogateescape'选项的空间有限,仅在解码未知编码的文件时才真正有用;它在任何情况下都无法处理表情符号

(编辑:这个答案与Python2.x相关,而不是与Python3.x相关)

目前,您正在使用默认编码将unicode字符串写入文件,而默认编码不支持表情符号(或者,就这一点而言,您可能确实需要大量字符)。您可以改为使用UTF-8编码进行写入,该编码支持所有unicode字符

不要执行file.write( data ),而是尝试file.write( data.encode("utf-8") )

相关问题 更多 >