如何用python获取xml文件中的特定节点

<AssetType longname="characters" shortname="chr" shortnames="chrs"> <type> pub </type> <type> geo </type> <type> rig </type> </AssetType> <AssetType longname="camera" shortname="cam" shortnames="cams"> <type> cam1 </type> <type> cam2 </type> <type> cam4 </type> </AssetType>

3条回答

网友

1楼 · 编辑于 2024-05-18 21:23:36

如果您不介意将整个文档加载到内存中：

from lxml import etree
data = etree.parse(fname)
result = [node.text.strip() 
    for node in data.xpath("//AssetType[@longname='characters']/type")]

您可能需要删除标记开头的空格才能使此操作生效。

网友

2楼 · 编辑于 2024-05-18 21:23:36

假设您的文档名为assets.xml，并且具有以下结构：

<assets>
    <AssetType>
        ...
    </AssetType>
    <AssetType>
        ...
    </AssetType>
</assets>

然后您可以执行以下操作：

from xml.etree.ElementTree import ElementTree
tree = ElementTree()
root = tree.parse("assets.xml")
for assetType in root.findall("//AssetType[@longname='characters']"):
    for type in assetType.getchildren():
        print type.text

网友

3楼 · 编辑于 2024-05-18 21:23:36

您可以使用pulldom API来处理对大文件的解析，而不必一次将其全部加载到内存中。与使用SAX相比，这提供了一个更方便的接口，而且性能损失很小。

它基本上允许您流式处理xml文件，直到找到您感兴趣的位为止，然后在这之后开始使用regular DOM operations。


from xml.dom import pulldom

# http://mail.python.org/pipermail/xml-sig/2005-March/011022.html
def getInnerText(oNode):
    rc = ""
    nodelist = oNode.childNodes
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc = rc + node.data
        elif node.nodeType==node.ELEMENT_NODE:
            rc = rc + getInnerText(node)   # recursive !!!
        elif node.nodeType==node.CDATA_SECTION_NODE:
            rc = rc + node.data
        else:
            # node.nodeType: PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, NOTATION_NODE and so on
           pass
    return rc


# xml_file is either a filename or a file
stream = pulldom.parse(xml_file) 
for event, node in stream:
    if event == "START_ELEMENT" and node.nodeName == "AssetType":
        if node.getAttribute("longname") == "characters":
            stream.expandNode(node) # node now contains a mini-dom tree
            type_nodes = node.getElementsByTagName('type')
            for type_node in type_nodes:
                # type_text will have the value of what's inside the type text
                type_text = getInnerText(type_node)

相关问题更多 >

编程相关推荐

热门问题

热门文章