简单的抓取不遵循链接和抓取

2024-10-02 02:26:24 发布

您现在位置：Python中文网/ 问答频道 /正文

8278

网友

男 | 程序猿一只，喜欢编程写python代码。

基本上，问题在于跟踪链接

我从第1、2、3、4、5、90页开始

每个页面有大约100个链接

每一页都是这种格式

http://www.consumercomplaints.in/lastcompanieslist/page/1
http://www.consumercomplaints.in/lastcompanieslist/page/2
http://www.consumercomplaints.in/lastcompanieslist/page/3
http://www.consumercomplaints.in/lastcompanieslist/page/4

这是正则表达式匹配规则

^{pr2}$

我将转到每个页面，然后创建一个Request对象来获取每个页面中的所有链接

Scrapy每次总共只爬行179个链接，然后给出一个finished状态

我做错什么了？在

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import urlparse

class consumercomplaints_spider(CrawlSpider):
    name = "test_complaints"
    allowed_domains = ["www.consumercomplaints.in"]
    protocol='http://'

    start_urls = [
        "http://www.consumercomplaints.in/lastcompanieslist/"
    ]

    #These are the rules for matching the domain links using a regularexpression, only matched links are crawled
    rules = [
        Rule(LinkExtractor(allow='(http:\/\/www\.consumercomplaints\.in\/lastcompanieslist\/page\/\d+)'),follow=True,callback="parse_data")
    ]


    def parse_data(self, response):
        #Get All the links in the page using xpath selector
        all_page_links = response.xpath('//td[@class="compl-text"]/a/@href').extract()

        #Convert each Relative page link to Absolute page link -> /abc.html -> www.domain.com/abc.html and then send Request object
        for relative_link in all_page_links:
            print "relative link procesed:"+relative_link

            absolute_link = urlparse.urljoin(self.protocol+self.allowed_domains[0],relative_link.strip())
            request = scrapy.Request(absolute_link,
                         callback=self.parse_complaint_page)
            return request


        return {}

    def parse_complaint_page(self,response):
        print "SCRAPED"+response.url
        return {}

Tags： the in import self http parse 链接 response

1条回答

网友

1楼 · 发布于 2024-10-02 02:26:24

你将需要使用收益而不是回报。在

for each new Request object, use yield request instead of return reqeust

查看更多关于产量here以及它们与原因here之间的区别

简单的抓取不遵循链接和抓取

相关问题更多 >

编程相关推荐

热门问题

热门文章

简单的抓取不遵循链接和抓取

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >