使用lxml和xpath解析Html

2024-05-21 16:57:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试将lxml与python结合使用,因为在阅读并执行google推荐之后,将lxml用于其他解析包。我有如下的dom结构,我设法编写正确的xpath,并在xpath检查中再次检查xpath以确认其有效性。Xpath在Xpath检查器上工作得很好,但是当我在python中将它与lxml放在一起时,我不会得到违反规则的结果,而是得到对象,而不是实际的文本。

这是我的dom结构:

<div class="pdsc-l">
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<td width="35%" valign="top">
<font size="2" face="Arial, Helvetica, sans-serif">Brand</font>
</td>
<td width="65%" valign="top">
<font size="2" face="Arial, Helvetica, sans-serif">HTC</font>
</td>
</tr>
<tr>
<td width="35%" valign="top">
<td width="65%" valign="top">

我写的xpath给了我想要的。。

//td//font[text()='Brand']/following::td[1]

但用lxml我要得到的结果是:

This is my code:
    rawPage = urllib2.urlopen(request)
    read = rawPage.read()
    #print read
    tree = etree.HTML(read)    
    for tr in tree.xpath("//tr"):
        print tr.xpath("//td//font[text()='Brand']/following::td[1]")

这是出口

[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]

我尝试了以下更改,但仍然没有得到结果,我编写的代码有url,希望这将有助于获得更好的答案:

from lxml import etree
from lxml.html import fromstring, tostring
    url = 'http://www.ebay.com/ctg/111176858'
    request = urllib2.Request(url)
    rawPage = urllib2.urlopen(request)
    read = rawPage.read()
    #print read
    tree = etree.HTML(read)    
    for tr in tree.xpath("//tr"):
        t = tr.xpath("//td//font[text()='Brand']/following::td[1]")[0]
        print tostring(t)

Tags: treereadtopelementwidthlxmlxpathtr
1条回答
网友
1楼 · 发布于 2024-05-21 16:57:23

在您的答案中的print语句的末尾附加一个[0].text应该可以满足您的需要。基本上,在您的问题中打印的是lxml.etree._Elements的单个元素列表,它们具有tagtext等属性,您可以使用这些属性来获得不同的属性。所以,试试看

tr.xpath("//td//font[text()='Brand']/following::td[1]")[0].text

相关问题 更多 >