<p>在<code>parse_link1</code>中,您将传递一个列表,它是<code>SelectorList</code>上的<code>.extract()</code>的结果(在<code>hxs</code>选择器上调用<code>.xpath()</code>的结果),作为<code>url</code>的值,<code>Request</code>构造函数的第一个参数,而预期只有一个值。在</p>
<p>使用<code>.extract_first()</code>代替:</p>
<pre><code>return Request(hxs.xpath('//*[@id="printContent"]/div[2]/table/tbody/tr[4]/td/table/tbody/tr/td[2]/a').extract_first()
</code></pre>
<hr/>
<p>OP评论后编辑</p>
^{pr2}$
<p>这是由于XPath表达式“过于保守”,可能是您的浏览器Inspect工具给出的(我在Chrome中测试了XPath,它适用于<a href="http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b" rel="nofollow">this example page</a>)</p>
<p>问题在于<code>.../table/tbody/tr/...</code>。问题是<code><tbody></code>很少有人编写的真正的HTML页面,甚至是模板(由人编写)。
HTML希望一个<code><table></code>有一个<code><tbody></code>,但是没有人真正关心,浏览器处理得很好(并且他们注入丢失的<code><tbody></code>元素来承载<code><tr></code>行)</p>
<p>因此,尽管它不是严格等价的XPath,但通常可以:</p>
<ul>
<li>省略<code>tbody/</code>并使用<code>table/tr</code>模式</li>
<li>或使用<code>table//tr</code></li>
</ul>
<p>使用<code>scrapy shell</code>查看它的实际操作:</p>
<pre><code>$ scrapy shell http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan/f6da68933af6383498691f19de7ebd4b
>>>
>>> # with XPath from browser tool (I assume), you get nothing for the "real" downloaded HTML
>>> response.xpath('//*[@id="printContent"]/div[2]/table/tbody/tr[4]/td/table/tbody/tr/td[2]/a')
[]
>>>
>>> # or, omitting `tbody/`
>>> response.xpath('//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a')
[<Selector xpath='//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a' data=u'<a href="/befattningshavare/de_Sauvage-N'>]
>>> # replacing "/table/tbody/" with "/table//" (tbody is added by browser to have "correct DOM tree")
>>> response.xpath('//*[@id="printContent"]/div[2]/table//tr[4]/td/table//tr/td[2]/a')
[<Selector xpath='//*[@id="printContent"]/div[2]/table//tr[4]/td/table//tr/td[2]/a' data=u'<a href="/befattningshavare/de_Sauvage-N'>]
>>>
>>> # suggestion: use the <img> tag after the <a> as predicate
>>> response.xpath('//*[@id="printContent"]/div[2]/table//tr/td/table//tr/td/a[img/@alt="personprofil"]')
[<Selector xpath='//*[@id="printContent"]/div[2]/table//tr/td/table//tr/td/a[img/@alt="personprofil"]' data=u'<a href="/befattningshavare/de_Sauvage-N'>]
>>>
</code></pre>
<p>此外,您还需要:</p>
<ul>
<li>获取“href”属性值(在XPath末尾添加<code>@href</code>)</li>
<li>建立一个绝对的网址。<code>response.urljoin()</code>是一个方便的快捷方式</li>
</ul>
<p>继续在破壳中:</p>
<pre><code>>>> response.xpath('//*[@id="printContent"]/div[2]/table/tr[4]/td/table/tr/td[2]/a/@href').extract_first()
u'/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b'
>>> response.urljoin(u'/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b')
u'http://www.allabolag.se/befattningshavare/de_Sauvage-Nolting%252C_Henri_Jacob_Jan_Personprofil/f6da68933af6383498691f19de7ebd4b'
>>>
</code></pre>
<p>最后,您的回调可能会变成:</p>
<pre><code>def parse_link1(self, response):
# .extract() returns a list here, after .xpath()
# so you can loop, even if you have 1 result
#
# XPaths can be multiline, it's easier to read for long expressions
for href in response.xpath('''
//*[@id="printContent"]
/div[2]
/table//tr[4]/td
/table//tr/td[2]/a/@href''').extract():
yield Request(response.urljoin(href),
callback=self.parse_link2)
</code></pre>