python从xm读取数据

2024-10-02 20:37:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我在python中使用scrapy。在

我试图从xml文件中获取xpath,如下所示:

def getMasterContainers(self):
    containers=[]
    containersFromXML = self.doc.findall('MasterPage/Containers/xpath')
    for oneXpath in containersFromXML:
        containers.append(oneXpath.text)
    return containers

xml文件是:

^{pr2}$

当我在cmd上打印结果时

container = ''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''

我的问题

当我尝试sel.xpath(self.containers[0])时,我没有得到任何结果,但是当我把xpath写在代码中时如下所示 sel.xpath('xpath written by hand')我得到了当前的数据。在

请帮忙。在


Tags: 文件selfdivdocdefxmlitemxpath
1条回答
网友
1楼 · 发布于 2024-10-02 20:37:35

更新:您确定您的问题是这个xpath吗?您确认它不会在xpath之前或之后失败吗?我不太确定如何使用scrapy运行scrape,所以我只手动运行XML解析,在实际文档和测试文档上运行以下操作对我很有用。在

第一.xml只包含xpath及其父结构:

<websiteInformation>
  <MasterPage>
    <Containers>
      <xpath>.//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']</xpath>
    </Containers>
  </MasterPage>
</websiteInformation>

解析第一.xml

^{pr2}$

输出:

.//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']

看起来不错。在

测试.html是:

<html>
  <body>
    <div id="results-list">
      <div class="item paid-featured-item">
        <div class="listing-item">Found A</div>
      </div>
      <div class="item paid-featured-item">
        <div class="listing-item">Found B</div>
      </div>
    </div>
  </body>
</html>

用以下方法搜索:

from scrapy.selector import Selector

sel = Selector(text=open('test.html').read())
for container in containers:
    print "Xpath: {}".format(container)
    result = sel.xpath(container)
    print "Container: {}".format(len(result))
    for elem in result:
      print elem

输出:

Xpath: .//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
Container: 2
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">Found A</div>'>
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">Found B</div>'>

搜索使用wget输出获得的实际URL的结果:

Xpath: .//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
Container: 25
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">\n        \n    '>
# omitted 23
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">\n        \n    '>

您的xpath字符串似乎有多余的单引号('),它们不应该在那里。在XML中,它看起来像:

<xpath>'&apos;.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]&apos;'</xpath>

解析时将(如打印时所示):

''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''

您不希望周围的's。它应该是这样的:

.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]

如果可以编辑包含xpath的XML文件,请从每个<xpath>中删除前导'&apos;和尾随&apos;'。所以:

<Containers>
  <xpath>'&apos;.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]&apos;'</xpath>
</Containers>

应该变成:

<Containers>
  <xpath>.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]</xpath>
</Containers>

但是,如果由于某种原因不能编辑XML文件,那么在获得xpath文本之后,请去掉它周围的's。因此:

containers.append(oneXpath.text)

应该变成:

containers.append(oneXpath.text.strip("'"))

相关问题 更多 >