lxml按regex查找标记

2条回答

网友

1楼 · 编辑于 2024-10-02 04:35:20

有一个想法：

import lxml.etree

doc = lxml.etree.parse('test.xml')
elements = [x for x in doc.xpath('//*') if x.tag.startswith('TEXT')]

网友

2楼 · 编辑于 2024-10-02 04:35:20

是的，您可以使用regular expressions in lxml xpath。在

以下是一个例子：

results = root.xpath(
    "//*[re:test(local-name(), '^TEXT.*')]",
    namespaces={'re': "http://exslt.org/regular-expressions"})

当然，在您提到的示例中，并不需要正则表达式。您可以使用^{}xpath函数：

^{pr2}$

完整程序：

from lxml import etree

root = etree.XML('''
    <root>
      <TEXT1>one</TEXT1>
      <TEXT2>two</TEXT2>
      <TEXT3>three</TEXT3>
      <x-TEXT4>but never four</x-TEXT4>
    </root>''')

result1 = root.xpath(
    "//*[re:test(local-name(), '^TEXT.*')]",
    namespaces={'re': "http://exslt.org/regular-expressions"})

result2 = root.xpath("//*[starts-with(local-name(), 'TEXT')]")

assert(result1 == result2)

for result in result1:
    print result.text, result.tag

为了满足新的需求，请考虑以下XML：

<root>
   <tag>
      <TEXT1>one</TEXT1>
      <TEXT2>two</TEXT2>
      <TEXT3>three</TEXT3>
   </tag>
   <other_tag>
      <TEXT1>do not want to found one</TEXT1>
      <TEXT2>do not want to found two</TEXT2>
      <TEXT3>do not want to found three</TEXT3>
   </other_tag>
</root>

如果要查找所有TEXT元素，这些元素是<tag>元素的直接子元素：

result = root.xpath("//tag/*[starts-with(local-name(), 'TEXT')]")
assert(' '.join(e.text for e in result) == 'one two three')

或者，如果希望所有TEXT元素都是第一个tag元素的直接子元素：

result = root.xpath("//tag[1]/*[starts-with(local-name(), 'TEXT')]")
assert(' '.join(e.text for e in result) == 'one two three')

或者，如果只想找到每个tag元素的第一个TEXT元素：

result = root.xpath("//tag/*[starts-with(local-name(), 'TEXT')][1]")
assert(' '.join(e.text for e in result) == 'one')

解决方法：

相关问题更多 >

编程相关推荐

热门问题

热门文章

lxml按regex查找标记

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >