lxmlpython,读取给定XML fi结构的文本和树

2024-10-05 15:20:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试获取节点下的文本和id,请参见这里的示例文件:example.xml

但是,它没有普通XML文件的结构。结构如下:

<TextWithNodes><Node id="0"/>
<Node id="1"/>
<Node id="2"/>9407011<Node id="9"/>
<Node id="10"/>ACL<Node id="13"/> <Node id="14"/>1994<Node id="18"/>
<Node id="19"/> Lg.Pr.Dc <Node id="29"/>

我想要的输出是start_nodeend_nodetext_between_node的列表。我不确定是否可以使用lxml库来实现这一点。你知道吗

目前,我使用

from lxml import etree
tree = etree.parse('9407011.az-scixml.xml')
nodes = tree.xpath('//TextWithNodes')[0].getchildren()
node = nodes[0] # example one node
print(node.text) # this give empty string because you don't have closing same id

Tags: 文件text文本idnodetree节点example
1条回答
网友
1楼 · 发布于 2024-10-05 15:20:53

使用XPath可能适合您。将normalize-space()与空字符串进行比较将消除没有后续文本的节点。你知道吗

这可能适合您:

from lxml import etree as ET
root = ET.XML(b'''<?xml version='1.0' encoding='UTF-8'?>
<GateDocument version="3">
<TextWithNodes><Node id="0"/>
<Node id="1"/>
<Node id="2"/>9407011<Node id="9"/>
<Node id="10"/>ACL<Node id="13"/> <Node id="14"/>1994<Node id="18"/>
<Node id="19"/> Lg.Pr.Dc <Node id="29"/>
</TextWithNodes></GateDocument>''')

# Grab each 'Node' element:
#  Only if the element has an 'id' attribute, and only if
#  the first sibling is a text node that isn't
#  all wihtespace and only if
#  the second sibling is a 'Node' with an 'id'
for r in root.xpath('''//Node[@id]
                           [following-sibling::node()
                               [1]
                               [self::text()]
                               [normalize-space() != ""]]
                           [following-sibling::node()
                               [2]
                               [self::Node[@id]]]'''):
    # All elements that satisfy that above XPath should
    # also satisfy the requirements for the next line
    print (r.get('id'), repr(r.tail), r.getnext().get('id'))

相关问题 更多 >