Python 3.4在写入文件时删除或忽略表情符号

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?> <?xml-stylesheet type="text/xsl" href="sms.xsl"?> <smses count="1"> <sms protocol="0" address="+00000000000" date="1346772606199" type="1" subject="null" body="Lorem ipsum dolor sit amet, consectetur adipisicing elit," toa="null" sc_toa="null" service_center="+00000000000" read="1" status="-1" locked="0" date_sent="1346772343000" readable_date="Sep 4, 2012 10:30:06 AM" contact_name="John Doe" /> </smses>

2条回答

网友

1楼 · 编辑于 2024-10-03 02:46:29

您有两个选择：

选择可以处理表情符号代码点的编码。您已使用默认编解码器（取决于您的系统）打开文件进行写入，或者选择了不支持代码点的显式编码
UTF编码可以很好地处理代码点；我在这里选择UTF-8：
```
with open(filename, 'w', encoding='utf8') as outfile:
    outfile.write(yourdata)
```
设置错误处理模式，用替换字符、转义序列替换编解码器无法处理的代码点，或完全忽略它们。请参阅^{} functionerrors参数：
errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. A variety of standard error handlers are available, though any error handling name that has been registered with codecs.register_error() is also valid. The standard names are:
- 'strict' to raise a ValueError exception if there is an encoding error. The default value of None has the same effect.
- 'ignore' ignores errors. Note that ignoring encoding errors can lead to data loss.
- 'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data.
- 'surrogateescape' will represent any incorrect bytes as code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the surrogateescape error handler is used when writing data. This is useful for processing files in an unknown encoding.
- 'xmlcharrefreplace' is only supported when writing to a file. Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.
- 'backslashreplace' (also only supported when writing) replaces unsupported characters with Python’s backslashed escape sequences.
因此，使用errors='ignore'打开文件将不会写入表情符号代码点，而不会引发错误：
```
with open(filename, 'w', errors='ignore') as outfile:
    outfile.write(yourdata)
```

演示：

>>> a_ok = 'The U+1F44C OK HAND SIGN codepoint: \U0001F44C'
>>> print(a_ok)
The U+1F44C OK HAND SIGN codepoint: 👌
>>> a_ok.encode('utf8')
b'The U+1F44C OK HAND SIGN codepoint: \xf0\x9f\x91\x8c'
>>> a_ok.encode('cp1251', errors='ignore')
b'The U+1F44C OK HAND SIGN codepoint: '
>>> a_ok.encode('cp1251', errors='replace')
b'The U+1F44C OK HAND SIGN codepoint: ?'
>>> a_ok.encode('cp1251', errors='xmlcharrefreplace')
b'The U+1F44C OK HAND SIGN codepoint: &#128076;'
>>> a_ok.encode('cp1251', errors='backslashreplace')
b'The U+1F44C OK HAND SIGN codepoint: \\U0001f44c'

请注意'surrogateescape'选项的空间有限，仅在解码未知编码的文件时才真正有用；它在任何情况下都无法处理表情符号

网友

2楼 · 编辑于 2024-10-03 02:46:29

（编辑：这个答案与Python2.x相关，而不是与Python3.x相关）

目前，您正在使用默认编码将unicode字符串写入文件，而默认编码不支持表情符号（或者，就这一点而言，您可能确实需要大量字符）。您可以改为使用UTF-8编码进行写入，该编码支持所有unicode字符

不要执行file.write( data )，而是尝试file.write( data.encode("utf-8") )

相关问题更多 >

编程相关推荐

热门问题

热门文章