<p>我刚刚看到你在这里发布了与你之前在<a href="https://stackoverflow.com/questions/36083221/how-to-use-scrapy-to-crawl-data-from-multipages-which-are-implemented-by-javascr/36110175#36110175">this post</a>中已经发布的相同的问题,我昨天已经回答了这个问题。所以我再次把我的答案贴在这里,让主持人来决定。。。在</p>
<p>当将链接解析和请求生成添加到parse()函数中时,您的示例正好适用于我。也许这个页面会生成一些服务器端cookies。但是使用像<a href="http://crawlera.com/" rel="nofollow">Scrapy's Crawlera</a>(从多个IP下载)这样的代理服务会失败。在</p>
<p>解决方案是将“textquery”参数手动输入到请求url中:</p>
<pre><code>import urlparse
from urllib import urlencode
from scrapy import Request
from scrapy.spiders import Spider
from scrapy.selector import Selector
class EPGD_spider(Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = 'calb'
base_url = "http://epgd.biosino.org/EPGD/search/textsearch.jsp?currentIndex=0&textquery=%s"
start_urls = [base_url % term]
def update_url(self, url, params):
url_parts = list(urlparse.urlparse(url))
query = dict(urlparse.parse_qsl(url_parts[4]))
query.update(params)
url_parts[4] = urlencode(query)
url = urlparse.urlunparse(url_parts)
return url
def parse(self, response):
sel = Selector(response)
genes = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
for gene in genes:
item = {}
item['genID'] = map(unicode.strip, gene.xpath('td[1]/a/text()').extract())
# ...
yield item
urls = sel.xpath('//div[@id="nviRecords"]/span[@id="quickPage"]/a/@href').extract()
for url in urls:
url = response.urljoin(url)
url = self.update_url(url, params={'textquery': self.term})
yield Request(url)
</code></pre>
<p>从Lukasz的解决方案更新\u url()函数详细信息:<br/>
<a href="https://stackoverflow.com/questions/2506379/add-params-to-given-url-in-python">Add params to given URL in Python</a></p>