从XML文档中有条件地删除元素

<span class="nobr"> <a href="http://www.google.com/"> http://www.google.com/ <sup> <img align="absmiddle" alt="" border="0" class="rendericon" height="7" src="http://jira.atlassian.com/icon.gif" width="7"/> </sup> </a> </span>

doc = self.__refactor_links(doc) ... def __refactor_links(self, node): """Recursively seeks for links to refactor them""" for span in node.childNodes: replace = False if isinstance(span, xml.dom.minidom.Element): if span.tagName == "span" and span.getAttribute("class") == "nobr": if span.childNodes.length == 1: a = span.childNodes.item(0) if isinstance(a, xml.dom.minidom.Element): if a.tagName == "a" and a.getAttribute("href"): if a.childNodes.length == 2: aurl = a.childNodes.item(0) if isinstance(aurl, xml.dom.minidom.Text): sup = a.childNodes.item(1) if isinstance(sup, xml.dom.minidom.Element): if sup.tagName == "sup": if sup.childNodes.length == 1: img = sup.childNodes.item(0) if isinstance(img, xml.dom.minidom.Element): if img.tagName == "img" and img.getAttribute("class") == "rendericon": replace = True else: self.__refactor_links(span) if replace: a.removeChild(sup) return node

3条回答

网友

1楼 · 编辑于 2024-07-06 21:51:31

这里有一个关于lxml的快速操作。强烈推荐xpath。在

>>> from lxml import etree
>>> doc = etree.XML("""<span class="nobr">
...  <a href="http://www.google.com/">
...   http://www.google.com/
...   <sup>
...    <img align="absmiddle" alt="" border="0" class="rendericon" height="7" src="http://jira.atlassian.com/icon.gif" width="7"/>
...   </sup>
...  </a>
... </span>""")
>>> for a in doc.xpath('//span[@class="nobr"]/a[@href="http://www.google.com/"]'):
...     for sub in list(a):
...         a.remove(sub)
...
>>> print etree.tostring(doc,pretty_print=True)
<span class="nobr">
 <a href="http://www.google.com/">
  http://www.google.com/
  </a>
</span>

网友

2楼 · 编辑于 2024-07-06 21:51:31

我不擅长xml，但是你不能在节点上使用find/search吗

>>> from xml.dom.minidom import parse, parseString
>>> dom = parseString(x)
>>> k = dom.getElementsByTagName('sup')
>>> for l in k:
...     p = l.parentNode
...     p.removeChild(l)
... 
<DOM Element: sup at 0x100587d40>
>>> 
>>> print dom.toxml()
<?xml version="1.0" ?><span class="nobr">
 <a href="http://www.google.com/">
  http://www.google.com/

 </a>
</span>
>>>

网友

3楼 · 编辑于 2024-07-06 21:51:31

这绝对是XPath表达式的任务，在您的例子中可能与lxml结合使用。在

XPath可能是这样的：

//span[@class="nobr"]/a[@href]/sup[img/@class="rendericon"]

将树与此XPath表达式匹配，并删除所有匹配的元素。不需要无休止的if构造或递归。在

相关问题更多 >

编程相关推荐

热门问题

热门文章