如何使用scrapy通过发布不同的数据来抓取同一个url?

2024-10-03 13:18:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我想通过发布不同的页码来抓取一个网站,但是我只得到第一个页面的数据,然后蜘蛛完成了,我想可能是相同的url,它过滤了scrappy
这是我的代码:

class ZhejiangCrawl(Spider):
    name = 'ZhejiangCrawl'
    root_url= 'http://www.zjsfgkw.cn/Execute/CreditCompany'
    start_page = 1
    current_page = start_page
    end_page = 24974
    post_data = {'PageNo': str(current_page), 'PageSize': '5', 'ReallyName': '', 'CredentialsNumber': '', 'AH': '',
                      'ZXFY': '', 'StartLARQ': '','EndLARQ':''}
    headers = HEADER
    cookies = COOKIES

    def start_requests(self):
        return [FormRequest(self.root_url, headers=self.headers, cookies=self.cookies, formdata=self.post_data, dont_filter=True,
                        callback=self.parse)]

    def parse(self, response):
        if self.current_page < self.end_page:
            self.current_page += 1
            self.post_data['PageNo'] = str(self.current_page)
            yield [FormRequest(self.root_url, headers=self.headers, cookies=self.cookies, dont_filter=True,
                           formdata=self.post_data, callback=self.parse)]

        jsonstr = json.loads(response.body)
        for item_dict in jsonstr['informationmodels']:
            item = ZhejiangcrawlItem()
            item['name'] = item_dict['ReallyName']
            item['cardNum'] = item_dict['CredentialsNumber']
            item['performance'] = item_dict['ZXJE']
            item['unperformance'] = item_dict['WZXJE']
            item['gistUnit'] = item_dict['ZXFY']
            item['address'] = item_dict['Address']
            item['gistId'] = item_dict['ZXYJ']
            item['caseCode'] = item_dict['AH']
            item['regDate'] = item_dict['LARQ']
            item['exposureDate'] = item_dict['BGRQ']
            item['gistReason'] = item_dict['ZXAY']
            yield item

如何修复?在


Tags: nameselfurldataparsepagerootcurrent
1条回答
网友
1楼 · 发布于 2024-10-03 13:18:44

如果您认为它是由于DupeFilter而被过滤的,那么将dont_filter=True添加到FormRequests中。在

另外要注意的是,没有理由从您的已生成/返回的内容中列出列表。在

相关问题 更多 >