<p>问题是<em>HTML在这个页面上的格式远远不够好。要演示,请查看相同的CSS选择器如何使用Scrapy生成0个结果,并在<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" rel="nofollow">^{<cd1>}</a>中生成94个结果:</p>
<pre><code>In [1]: from bs4 import BeautifulSoup
In [2]: soup = BeautifulSoup(response.body, 'html5lib') # note: "html5lib" has to be installed
In [3]: len(soup.select(".article h4 a"))
Out[3]: 94
In [4]: len(response.css(".article h4 a"))
Out[4]: 0
</code></pre>
<p>您要查找的<code>pubBody</code>元素也是如此:</p>
^{pr2}$
<p>所以,尝试连接<code>BeautifulSoup</code>来修复/清理HTML—最好是通过<a href="http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#writing-your-own-downloader-middleware" rel="nofollow">middleware</a>。在</p>
<hr/>
<p>我创建了一个简单的<a href="https://github.com/alecxe/scrapy-beautifulsoup" rel="nofollow">^{<cd4>} middleware</a>来轻松连接到项目中:</p>
<ul>
<li><p>通过pip安装:</p>
<pre><code>pip install scrapy-beautifulsoup
</code></pre></li>
<li><p>在<code>settings.py</code>中配置中间件:</p>
<pre><code>DOWNLOADER_MIDDLEWARES = {
'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 543
}
BEAUTIFULSOUP_PARSER = "html5lib"
</code></pre></li>
</ul>
<p>利润。在</p>