Scrapy不分析项目

2024-09-29 00:18:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图刮一个网页与钉住,但呼吁回不分析的项目,任何帮助将不胜感激…这里是代码

# -*- coding: utf-8 -*-
import scrapy
from ..items import EscrotsItem

class Escorts(scrapy.Spider):
    name = 'escorts'
    allowed_domains = ['www.escortsandbabes.com.au']
    start_urls = ['https://escortsandbabes.com.au/Directory/ACT/Canberra/2600/Any/All/']

    def parse_links(self, response):
        for i in response.css('.btn.btn-default.btn-block::attr(href)').extract()[2:]:
            yield scrapy.Request(url=response.urljoin(i),callback=self.parse)
        NextPage = response.css('.page.next-page::attr(href)').extract_first()
        if NextPage:
            yield scrapy.Request(
                url=response.urljoin(NextPage),
                callback=self.parse_links)

    def parse(self, response):
        for x in response.xpath('//div[@class="advertiser-profile"]'):
            item = EscrotsItem()
            item['Name'] = x.css('.advertiser-names--display-name::text').extract_first()
            item['Username'] = x.css('.advertiser-names--username::text').extract_first()
            item['Phone'] = x.css('.contact-number::text').extract_first()
            yield item

Tags: textimportselfparseresponseextractitemcss
1条回答
网友
1楼 · 发布于 2024-09-29 00:18:28

您的代码从start_urls调用url并转到parse函数。因为没有任何div.advertiser-profile元素,所以它确实应该在没有任何结果的情况下关闭。所以根本不调用parse_links函数。你知道吗

更改函数名称:

import scrapy


class Escorts(scrapy.Spider):
    name = 'escorts'
    allowed_domains = ['escortsandbabes.com.au']
    start_urls = ['https://escortsandbabes.com.au/Directory/ACT/Canberra/2600/Any/All/']

    def parse(self, response):
        for i in response.css('.btn.btn-default.btn-block::attr(href)').extract()[2:]:
            yield scrapy.Request(response.urljoin(i), self.parse_links)
        next_page = response.css('.page.next-page::attr(href)').get()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page))

    def parse_links(self, response):
        for x in response.xpath('//div[@class="advertiser-profile"]'):
            item = {}
            item['Name'] = x.css('.advertiser-names display-name::text').get()
            item['Username'] = x.css('.advertiser-names username::text').get()
            item['Phone'] = x.css('.contact-number::text').get()
            yield item

我来自scrapy shell的日志:

In [1]: fetch("https://escortsandbabes.com.au/Directory/ACT/Canberra/2600/Any/All/")
2019-03-29 15:22:56 [scrapy.core.engine] INFO: Spider opened
2019-03-29 15:23:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://escortsandbabes.com.au/Directory/ACT/Canberra/2600/Any/All/> (referer: None, latency: 2.48 s)

In [2]: response.css('.page.next-page::attr(href)').get()
Out[2]: u'/Directory/ACT/Canberra/2600/Any/All/?p=2'

相关问题 更多 >