替换什么xml.dom.minidom为了得到能有效腌制的东西?

2024-10-02 22:27:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个处理~1-2 MB XML文件的应用程序。听起来不算多,但我还是遇到了性能问题。你知道吗

因为我有一些计算绑定的任务,我想加快我已经尝试使用多处理.imap要做到这一点,需要对XML数据进行酸洗。将包含引用的数据结构pickle到这个DOM的速度比那些计算绑定的进程慢,而罪魁祸首似乎是递归-为了让pickle在第一时间工作,我必须将递归限制设置为10000

不管怎样,我的问题是:

如果我想从引用性能的角度来解决这个问题,我应该用什么来代替minidom?标准既有酸洗性能,也有易过渡性。你知道吗

为了让您了解需要什么样的方法,我粘贴了一个包装器类(为了加快getElementsByTagName调用,有时在前面编写)。可以将所有minidom节点替换为与该类属于同一接口的节点,也就是说,我不需要minidom中的所有方法。摆脱parentNode方法也是可以接受的(为了提高酸洗性能,这可能是一个好主意)。你知道吗

是的,如果我现在设计这个,我一开始就不会使用XML节点引用,但是现在要把所有这些都去掉会有很多工作要做,所以我希望可以用补丁来代替。你知道吗

我应该自己用python内置或集合库来写这个该死的东西吗?你知道吗

class ImmutableDOMNode(object):
    def __init__(self, node):
        self.node = node
        self.cachedElementsByTagName = {}

    @property
    def nodeType(self):
        return self.node.nodeType

    @property
    def tagName(self):
        return self.node.tagName

    @property
    def ownerDocument(self):
        return self.node.ownerDocument

    @property
    def nodeName(self):
        return self.node.nodeName

    @property
    def nodeValue(self):
        return self.node.nodeValue

    @property
    def attributes(self):
        return self.node.attributes

    @property
    def parentNode(self):
        return ImmutableDOMNode(self.node.parentNode)

    @property
    def firstChild(self):
        return ImmutableDOMNode(self.node.firstChild)

    @property
    def childNodes(self):
        return [ImmutableDOMNode(node) for node in self.node.childNodes]

    def getElementsByTagName(self, name):
        result = self.cachedElementsByTagName.get(name)
        if result != None:
            return result
        uncachedResult = self.node.getElementsByTagName(name)
        cachedResult = [ImmutableDOMNode(node) for node in uncachedResult]
        self.cachedElementsByTagName[name] = cachedResult
        return cachedResult

    def getAttribute(self, qName):
        return self.node.getAttribute(qName)

    def toxml(self, encoding=None):
        return self.node.toxml(encoding)

    def toprettyxml(self, indent="", newl="", encoding=None):
        return self.node.toprettyxml(indent, newl, encoding)

    def appendChild(self, node):
        raise Exception("cannot append child to immutable node")

    def removeChild(self, node):
        raise Exception("cannot remove child from immutable node")

    def cloneNode(self, deep):
        raise Exception("clone node not implemented")

    def createElement(self, tagName):
        raise Exception("cannot create element for immutable node")

    def createTextNode(self, tagName):
        raise Exception("cannot create text node for immutable node")

    def createAttribute(self, qName):
        raise Exception("cannot create attribute for immutable node")

Tags: nameselfnodeforreturndefexceptionproperty
1条回答
网友
1楼 · 发布于 2024-10-02 22:27:28

所以我决定只做我自己的DOM实现来满足我的需求,我把它粘贴在下面,以防对别人有所帮助。它依赖于来自memoization library for python 2.7的lru\u缓存和来自Immutable dictionary, only use as a key for another dictionary的@Raymond Hettinger的不可变dict。但是,如果您不介意安全性/性能降低,那么这些依赖关系很容易删除。你知道吗

class CycleFreeDOMNode(object):
    def __init__(self, minidomNode=None):
        if minidomNode is None:
            return
        if not isinstance(minidomNode, xml.dom.minidom.Node):
            raise ValueError("%s needs to be instantiated with a minidom.Node" %(
                type(self).__name__
            ))
        if minidomNode.nodeValue and minidomNode.childNodes:
            raise ValueError(
                "both nodeValue and childNodes in same node are not supported"
            )
        self._tagName = minidomNode.tagName \
            if hasattr(minidomNode, "tagName") else None
        self._nodeType = minidomNode.nodeType
        self._nodeName = minidomNode.nodeName
        self._nodeValue = minidomNode.nodeValue
        self._attributes = dict(
            item
            for item in minidomNode.attributes.items()
        ) if minidomNode.attributes else {}
        self._childNodes = tuple(
            CycleFreeDOMNode(cn)
            for cn in minidomNode.childNodes
        )
        childNodesByTagName = defaultdict(list)
        for cn in self._childNodes:
            childNodesByTagName[cn.tagName].append(cn)
        self._childNodesByTagName = ImmutableDict(childNodesByTagName)

    @property
    def nodeType(self):
        return self._nodeType

    @property
    def tagName(self):
        return self._tagName

    @property
    def nodeName(self):
        return self._nodeName

    @property
    def nodeValue(self):
        return self._nodeValue

    @property
    def attributes(self):
        return self._attributes

    @property
    def firstChild(self):
        return self._childNodes[0] if self._childNodes else None

    @property
    def childNodes(self):
        return self._childNodes

    @lru_cache(maxsize = 100)
    def getElementsByTagName(self, name):
        result = self._childNodesByTagName.get(name, [])
        for cn in self.childNodes:
            result += cn.getElementsByTagName(name)
        return result

    def cloneNode(self, deep=False):
        clone = CycleFreeDOMNode()
        clone._tagName = self._tagName
        clone._nodeType = self._nodeType
        clone._nodeName = self._nodeName
        clone._nodeValue = self._nodeValue
        clone._attributes = copy.copy(self._attributes)
        if deep:
            clone._childNodes = tuple(
                cn.cloneNode(deep)
                for cn in self.childNodes
            )
            childNodesByTagName = defaultdict(list)
            for cn in clone._childNodes:
                childNodesByTagName[cn.tagName].append(cn)
            clone._childNodesByTagName = ImmutableDict(childNodesByTagName)
        else:
            clone._childNodes = tuple(cn for cn in self.childNodes)
            clone._childNodesByTagName = self._childNodesByTagName
        return clone

    def toxml(self):
        def makeXMLForContent():
            return self.nodeValue or "".join([
                cn.toxml() for cn in self.childNodes
            ])

        if not self.tagName:
            return makeXMLForContent()
        return "<%s%s>%s</%s>" %(
            self.tagName,
            " " + ", ".join([
                "%s=\"%s\"" %(k,v)
                for k,v in self.attributes.items()
            ]) if any(self.attributes) else "",
            makeXMLForContent(),
            self.tagName
        )

    def getAttribute(self, name):
        return self._attributes.get(name, "")

    def setAttribute(self, name, value):
        self._attributes[name] = value

相关问题 更多 >