在元素和属性中搜索字符串

html = """ <html> <head></head> <body> 1 <a href="/test1" id="download">test 1</a> 2 <a href="/test2" class="download">test 2</a> 3 <a href="/download">test 3</a> 4 <a href="/test4">DoWnLoAd</a> 5 <a href="/test5">ascascDoWnLoAdsacsa</a> 6 <a href="/test6"><div id="test6">download</div></a> 7 <a href="/test7"><div id="download">test7</div></a> </body> </html> """ from lxml import etree tree = etree.fromstring(html, etree.HTMLParser()) downloadElementConditions = "//a[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]" elements = tree.xpath(downloadElementConditions) print 'FOUND ELEMENTS:', len(elements) for i in elements: print i.get('href'), i.text

from lxml import etree tree = etree.fromstring(html, etree.HTMLParser()) downloadElementConditions = "//*[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]" elements = tree.xpath(downloadElementConditions) print 'FOUND ELEMENTS:', len(elements) for el in elements: href = el.get('href') if href: print el.get('href'), el.text else: elparent = el for _ in range(10): # loop over 10 parents elparent = elparent.getparent() href = elparent.get('href') if href: print elparent.get('href'), elparent.text break

2条回答

网友

1楼 · 编辑于 2024-10-03 13:23:35

将Xpathselect从严格匹配的a标记更改为通配符应该可以做到： "//*[(@id|@class|@href|text())[contains(translate(.,'DOWNLOAD','download'), 'download')]]"

网友

2楼 · 编辑于 2024-10-03 13:23:35

纯XPath解决方案

将text()更改为.，并在descendent-or-self轴上搜索属性：

//a[(.|.//@id|.//@class|.//@href)[contains(translate(.,'DOWNLOAD','download'),'download')]]

解释：

text()vs.：此处text()将匹配a的立即文本节点子级；.将匹配a元素的字符串值。在以捕获存在a子元素的情况包含目标文本时，要匹配 a。你知道吗
后代或自身：为了匹配a及其任何后代的属性，使用descendant-or-self轴（.//）。你知道吗

有关XPath中字符串值的详细信息，请参见Matching text nodes is different than matching string values.

纯XPath解决方案

相关问题更多 >

编程相关推荐

热门问题

热门文章