Scrapy:遍历列表和分页失败

2024-06-25 23:34:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我的目标是提取每页25行(每行6项),然后在40页中的每一页上迭代。在

目前,我的spider从第1-3页提取第一行(参见CSV输出图像)。在

我假设,list_iterator()函数将迭代每一行;但是,我的rules或{}函数中似乎有一个错误,不允许取消每页的所有行。在

如有任何帮助或建议,我们将不胜感激!在

propub公司_蜘蛛网.py公司名称:

import scrapy 
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from propub.items import PropubItem
from scrapy.http import Request

class propubSpider(CrawlSpider):
    name = 'prop$'
    allowed_domains = ['https://projects.propublica.org']
    max_pages = 40
    start_urls = [
        'https://projects.propublica.org/docdollars/search?state%5Bid%5D=33',
        'https://projects.propublica.org/docdollars/search?page=2&state%5Bid%5D=33',
        'https://projects.propublica.org/docdollars/search?page=3&state%5Bid%5D=33']

    rules = (Rule(SgmlLinkExtractor(allow=('\\search?page=\\d')), 'parse_start_url', follow=True),)

    def list_iterator(self):
        for i in range(self.max_pages):
            yield Request('https://projects.propublica.org/docdollars/search?page=d' % i, callback=self.parse)

    def parse(self, response):
        for sel in response.xpath('//*[@id="payments_list"]/tbody'):
            item = PropubItem()
            item['payee'] = sel.xpath('tr[1]/td[1]/a[2]/text()').extract()
            item['link'] = sel.xpath('tr[1]/td[1]/a[1]/@href').extract()
            item['city'] = sel.xpath('tr[1]/td[2]/text()').extract()
            item['state'] = sel.xpath('tr[1]/td[3]/text()').extract()
            item['company'] = sel.xpath('tr[1]/td[4]').extract()
            item['amount'] =  sel.xpath('tr[1]/td[7]/span/text()').extract()
            yield item 

在管道.py公司名称:

^{pr2}$

在项目.py公司名称:

import scrapy
from scrapy.item import Item, Field

class PropubItem(scrapy.Item):
    payee = scrapy.Field()
    link = scrapy.Field()
    city = scrapy.Field()
    state = scrapy.Field()
    company = scrapy.Field()
    amount =  scrapy.Field()
    pass

CSV输出:

enter image description here


Tags: fromhttpsorgimportfieldsearchextractitem
1条回答
网友
1楼 · 发布于 2024-06-25 23:34:23

需要修复多个问题:

  • 使用start_requests()方法代替list_iterator()
  • 此处缺少%

    yield Request('https://projects.propublica.org/docdollars/search?page=%d' % i, callback=self.parse)
    #                                                                 HERE^
    
  • 您不需要CrawlSpider,因为您是通过start_requests()-use regularscrapy.Spider提供分页链接的
  • 如果XPath表达式能够按类属性匹配单元格,则更可靠

固定版本:

^{pr2}$

相关问题 更多 >