如何在Scrapy上实现自定义代理?

2024-09-27 19:35:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图实现定制的scraperapi,但我认为我做错了。但是我按照他们的文档来设置一切。这是一个文档link

from scrapy import Spider
from scrapy.http import Request
from .config import API
from scraper_api import ScraperAPIClient
client = ScraperAPIClient(API)



class GlassSpider(Spider):
    name = 'glass'
    allowed_domains = ['glassdoor.co.uk']
    start_urls = [client.scrapyGet(url='https://www.glassdoor.co.uk/Job/russian-jobs-SRCH_KE0,7.htm?fromAge=1')]
   

    def parse(self, response):
        jobs = response.xpath('//*[contains(@class, "react-job-listing")]')
        for job in jobs:
            job_url = job.xpath('.//*[contains(@class, "jobInfoItem jobTitle")]/@href').extract_first()
            absulate_job_url = response.urljoin(job_url)

            yield Request(client.scrapyGet(url=absulate_job_url),
                           callback=self.parse_jobpage,
                           meta={
                               "Job URL": absulate_job_url
                        })

    def parse_jobpage(self, response): 
        absulate_job_url = response.meta.get('Job URL')
        job_description = "".join(line for line in response.xpath('//*[contains(@class, "desc")]//text()').extract())

        yield {
            "Job URL": absulate_job_url,   
            "Job Description": job_description
        }

这就是我收到的输出。。。。请问我的代码怎么了。请帮我修一下。这样我就可以明白了。多谢各位

2020-10-01 23:01:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.scraperapi.com/?url=https%3A%2F%2Fwww.glassdoor.co.uk%2FJob%2F russian-jobs-SRCH_KE0%2C7.htm%3FfromAge%3D1&api_key=bec9dd9f2be095dfc6158a7e609&scraper_sdk=python> (referer: None) 2020-10-01 23:01:45 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET https://api.scraperapi.c om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6 7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61 5c8a7e639scraper_sdk=python>


Tags: fromimportclientapiurlresponsejobsjob
1条回答
网友
1楼 · 发布于 2024-09-27 19:35:44

我不熟悉这个特定的库,但是从你的执行日志中,问题是你的请求被过滤了,因为它考虑异地。p>

[scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'api.scraperapi.com': <GET https://api.scraperapi.c om/?url=https%3A%2F%2Fapi.scraperapi.com%2Fpartner%2FjobListing.htm%3Fpos%3D101%26ao%3D1044074%26s%3D149%26guid%3D00000174e51ccd8988e2e5420e6 7cf0d%26src%3DGD_JOB_AD%26t%3DSRFJ%26vt%3Dw%26cs%3D1_94f59ee8%26cb%3D1601571704401%26jobListingId%3D3696480795&api_key=bec9d9f82b0955c61 5c8a7e639scraper_sdk=python>

由于scraperapi将使您的请求通过其域,而这超出了您在allowed_domains中定义的范围,因此它被过滤为异地请求。要避免此问题,您可以完全删除此行:

allowed_domains = ['glassdoor.co.uk'] 

或者尝试在其中包含'api.scraperapi.com'

相关问题 更多 >

    热门问题