垃圾回调函数，如何解析几个页面？问题的回答

垃圾回调函数，如何解析几个页面？

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

在<code>parse_link1</code>中，您将传递一个列表，它是<code>SelectorList</code>上的<code>.extract()</code>的结果（在<code>hxs</code>选择器上调用<code>.xpath()</code>的结果），作为<code>url</code>的值，<code>Request</code>构造函数的第一个参数，而预期只有一个值。在 使用<code>.extract_first()</code>代替： <pre><code>return Request(hxs.xpath('//*[@id="printContent"]/div[2]/table/tbody/tr[4]/td/table/tbody/tr/td[2]/a').extract_first() </code></pre> <hr/> OP评论后编辑 ^{pr2}$ 这是由于XPath表达式“过于保守”，可能是您的浏览器Inspect工具给出的（我在Chrome中测试了XPath，它适用于<a href="http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b" rel="nofollow">this example page</a>） 问题在于<code>.../table/tbody/tr/...</code>。问题是<code><tbody></code>很少有人编写的真正的HTML页面，甚至是模板（由人编写）。 HTML希望一个<code><table></code>有一个<code><tbody></code>，但是没有人真正关心，浏览器处理得很好（并且他们注入丢失的<code><tbody></code>元素来承载<code><tr></code>行） 因此，尽管它不是严格等价的XPath，但通常可以： <ul> <li>省略<code>tbody/</code>并使用<code>table/tr</code>模式</li> <li>或使用<code>table//tr</code></li> </ul> 使用<code>scrapy shell</code>查看它的实际操作： <pre><code>$ scrapy shell http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b >>> >>> # with XPath from browser tool (I assume), you get nothing for the "real" downloaded HTML >>> response.xpath('//*[@id="printContent"]/div[2]/table/tbody/tr[4]/td/table/tbody/tr/td[2]/a') [] >>> >>> # or, omitting `tbody/` >>> response.xpath('//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a') [<Selector xpath='//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a' data=u'<a href="/befattningshavare/de_Sauvage-N'>] >>> # replacing "/table/tbody/" with "/table//" (tbody is added by browser to have "correct DOM tree") >>> response.xpath('//*[@id="printContent"]/div[2]/table//tr[4]/td/table//tr/td[2]/a') [<Selector xpath='//*[@id="printContent"]/div[2]/table//tr[4]/td/table//tr/td[2]/a' data=u'<a href="/befattningshavare/de_Sauvage-N'>] >>> >>> # suggestion: use the <img> tag after the <a> as predicate >>> response.xpath('//*[@id="printContent"]/div[2]/table//tr/td/table//tr/td/a[img/@alt="personprofil"]') [<Selector xpath='//*[@id="printContent"]/div[2]/table//tr/td/table//tr/td/a[img/@alt="personprofil"]' data=u'<a href="/befattningshavare/de_Sauvage-N'>] >>> </code></pre> 此外，您还需要： <ul> <li>获取“href”属性值（在XPath末尾添加<code>@href</code>）</li> <li>建立一个绝对的网址。<code>response.urljoin()</code>是一个方便的快捷方式</li> </ul> 继续在破壳中： <pre><code>>>> response.xpath('//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a/@href').extract_first() u'/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b' >>> response.urljoin(u'/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b') u'http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b' >>> </code></pre> 最后，您的回调可能会变成： <pre><code>def parse_link1(self, response): # .extract() returns a list here, after .xpath() # so you can loop, even if you have 1 result # # XPaths can be multiline, it's easier to read for long expressions for href in response.xpath(''' //*[@id="printContent"] /div[2] /table//tr[4]/td /table//tr/td[2]/a/@href''').extract(): yield Request(response.urljoin(href), callback=self.parse_link2) </code></pre>

垃圾回调函数，如何解析几个页面？

1 个回答

相关Python问题