用lxm解析HTML数据

<tr> <td><a href="website1.com">website1</a></td> <td>info1</td> <td>info2</td> <td><a href="spam1.com">spam1</a></td> </tr> <tr> <td><a href="website2.com">website2</a></td> <td>info1</td> <td>info2</td> <td><a href="spam2.com">spam2</a></td> </tr>

3条回答

网友

1楼 · 编辑于 2024-09-27 02:20:30

import lxml.html as LH

doc = LH.fromstring(content)
print([tr.xpath('td[1]/a/@href | td[position()=2 or position()=3]/text()')
       for tr in doc.xpath('//tr')])

长XPath具有以下含义：

^{pr2}$

网友

2楼 · 编辑于 2024-09-27 02:20:30

import lxml.html as lh

tree = lh.fromstring(your_html)

result = []
for row in tree.xpath("tr"):
    url, info1, info2 = row.xpath("td")[:3]
    result.append([url.xpath("a")[0].attrib['href'],
                   info1.text_content(),
                   info2.text_content()])

结果：

^{pr2}$

网友

3楼 · 编辑于 2024-09-27 02:20:30

我使用xpath：td/a[not(contains(.,"spam"))]/@href | td[not(a)]/text()

$ python3
>>> import lxml.html
>>> doc = lxml.html.parse('data.xml')
>>> [[j for j in i.xpath('td/a[not(contains(.,"spam"))]/@href | td[not(a)]/text()')] for i in doc.xpath('//tr')]
[['website1.com', 'info1', 'info2'], ['website2.com', 'info1', 'info2']]

相关问题更多 >

编程相关推荐

热门问题

热门文章

用lxm解析HTML数据

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >