在抓取第一页之前，如何在不知道曲奇是什么的情况下传递饼干？

__author__ = 'Rabbit' from scrapy.spiders import Spider from scrapy.selector import Selector from scrapy_Data.items import EPGD class EPGD_spider(Spider): name = "EPGD" allowed_domains = ["epgd.biosino.org"] stmp = [] term = "man" url_base = "http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=man&submit=Feeling+Lucky" start_urls = stmp def parse(self, response): sel = Selector(response) sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]') for site in sites: item = EPGD() item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract()) item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract()) item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract()) item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract()) item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract()) item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract()) yield item

2条回答

网友

1楼 · 编辑于 2024-09-28 22:40:51

我刚刚看到你在这里发布了与你之前在this post中已经发布的相同的问题，我昨天已经回答了这个问题。所以我再次把我的答案贴在这里，让主持人来决定。。。在

当将链接解析和请求生成添加到parse（）函数中时，您的示例正好适用于我。也许这个页面会生成一些服务器端cookies。但是使用像Scrapy's Crawlera（从多个IP下载）这样的代理服务会失败。在

解决方案是将“textquery”参数手动输入到请求url中：

import urlparse
from urllib import urlencode

from scrapy import Request
from scrapy.spiders import Spider
from scrapy.selector import Selector


class EPGD_spider(Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    term = 'calb'
    base_url = "http://epgd.biosino.org/EPGD/search/textsearch.jsp?currentIndex=0&textquery=%s"
    start_urls = [base_url % term]

    def update_url(self, url, params):
        url_parts = list(urlparse.urlparse(url))
        query = dict(urlparse.parse_qsl(url_parts[4]))
        query.update(params)
        url_parts[4] = urlencode(query)
        url = urlparse.urlunparse(url_parts)
        return url

    def parse(self, response):
        sel = Selector(response)
        genes = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')

        for gene in genes:
            item = {}
            item['genID'] = map(unicode.strip, gene.xpath('td[1]/a/text()').extract())
            # ...
            yield item

        urls = sel.xpath('//div[@id="nviRecords"]/span[@id="quickPage"]/a/@href').extract()
        for url in urls:
            url = response.urljoin(url)
            url = self.update_url(url, params={'textquery': self.term})
            yield Request(url)

从Lukasz的解决方案更新\u url（）函数详细信息：
Add params to given URL in Python

网友

2楼 · 编辑于 2024-09-28 22:40:51

Scrapy接收并跟踪服务器发送的cookie，并在后续请求时发送它们，就像任何普通的web浏览器一样，检查更多信息here

我看不出您是如何在代码上分页的，但应该是这样的：

class EPGD_spider(Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    stmp = []
    term = "man"
    my_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=man&submit=Feeling+Lucky"]

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')

        for site in sites:
            item = EPGD()
            item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
            item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
            item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
            item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
            item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
            item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
            yield item
        yield Request('http://epgd.biosino.org/EPGD/search/textsearch.jsp?currentIndex=10', 
                       callback=self.parse_second_url)

     def parse_second_url(self, response):
         # do your thing

第二个请求携带第一个请求的cookies。在

相关问题更多 >

编程相关推荐

热门问题

热门文章