在用Python编写ElementTree时,如何保留ASCII十六进制代码点?

2024-06-13 16:49:21 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经通过ElementTree解析器将一个xml文件(Rhythmbox的数据库文件)加载到python3中。在修改树并使用ascii编码将其写入磁盘(ElementTree.write())之后,所有十六进制码位的ascii十六进制字符都将转换为ascii十进制码位。例如,下面是一个包含版权符号的diff:

<     <copyright>&#xA9; WNYC</copyright>
---
>     <copyright>&#169; WNYC</copyright>

有没有办法告诉Python/ElementTree不要这样做?我希望所有的十六进制代码保持在十六进制代码点。在


Tags: 文件代码数据库解析器编码asciixml磁盘
1条回答
网友
1楼 · 发布于 2024-06-13 16:49:21

我找到了解决办法。首先,我创建了一个新的codec错误处理程序,然后monkey修补了ElementTree。看起来像:

from xml.etree import ElementTree
import io
import contextlib
import codecs


def lower_first(s):
    return s[:1].lower() + s[1:] if s else ''


def html_replace(exc):
    if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)):
        s = []
        for c in exc.object[exc.start:exc.end]:
            s.append('&#%s;' % lower_first(hex(ord(c))[1:].upper()))
        return ''.join(s), exc.end
    else:
        raise TypeError("can't handle %s" % exc.__name__)

codecs.register_error('html_replace', html_replace)


# monkey patch this python function to prevent it from using xmlcharrefreplace
@contextlib.contextmanager
def _get_writer(file_or_filename, encoding):
    # returns text write method and release all resources after using
    try:
        write = file_or_filename.write
    except AttributeError:
        # file_or_filename is a file name
        if encoding == "unicode":
            file = open(file_or_filename, "w")
        else:
            file = open(file_or_filename, "w", encoding=encoding,
                        errors="html_replace")
        with file:
            yield file.write
    else:
        # file_or_filename is a file-like object
        # encoding determines if it is a text or binary writer
        if encoding == "unicode":
            # use a text writer as is
            yield write
        else:
            # wrap a binary writer with TextIOWrapper
            with contextlib.ExitStack() as stack:
                if isinstance(file_or_filename, io.BufferedIOBase):
                    file = file_or_filename
                elif isinstance(file_or_filename, io.RawIOBase):
                    file = io.BufferedWriter(file_or_filename)
                    # Keep the original file open when the BufferedWriter is
                    # destroyed
                    stack.callback(file.detach)
                else:
                    # This is to handle passed objects that aren't in the
                    # IOBase hierarchy, but just have a write method
                    file = io.BufferedIOBase()
                    file.writable = lambda: True
                    file.write = write
                    try:
                        # TextIOWrapper uses this methods to determine
                        # if BOM (for UTF-16, etc) should be added
                        file.seekable = file_or_filename.seekable
                        file.tell = file_or_filename.tell
                    except AttributeError:
                        pass
                file = io.TextIOWrapper(file,
                                        encoding=encoding,
                                        errors='html_replace',
                                        newline="\n")
                # Keep the original file open when the TextIOWrapper is
                # destroyed
                stack.callback(file.detach)
                yield file.write

ElementTree._get_writer = _get_writer

相关问题 更多 >