删除JSON中提供的无限结果(“查看更多”)

2024-09-23 04:26:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我的目标是刮this URL

列表中的每个项目都链接到有关它的更多信息。我的目标是清除所有17000个链接页面。只显示10个结果,并且查看更多按钮发出请求,通过JSON向列表中添加10个以上的结果。我试图通过更改batchsize来修改请求,该参数用于定义列表中结果的数量,但没有起作用。我还尝试使用此代码(从atutorial),但无法将其调整为适合我的特定任务:

import json
import scrapy


class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5

    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

我看了一些例子hereherehere。然而,经过2天的尝试,我仍然不知道如何解决这个问题,因为我希望抓取的网站上的URL请求与所有示例不同,而且似乎它们使抓取变得更加困难

点击查看更多发出的请求如下:

Request URL: https://www.1177.se/api/hjv/search?batchsize=10&caretype=&componentname&cs=false&location=&p=2&q=&s=name&sortorder=name&st=4af2ed43-1154-4363-ae6b-718f9b84d23a

p=参数在点击查看更多时递增: enter image description here

返回的JSON具有以下格式:

{"Heading":"17952 träffar på Alla mottagningar","Query":"","Region":null,"NextPage":3,"Page":2,"BatchSize":10,"BatchText":"Visa 10 till","TotalHits":17952,"SortOrder":"name","Latitude":0.0,"Longitude":0.0,"Bounds":null,"SearchHits":[{"HsaId":"SE162321000255-O23228","FriendlyUrl":"/hitta-vard/kontaktkort/A5-Psykoterapi-Katia-Karlsson-Carli-AB-Lund/","DisplayName":"A5 Psykoterapi Katia Karlsson Carli AB, Lund","Address":"Stortorget 1, Lund","PhoneNumber":"073-046 26 68","HasMvkServices":true,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":55.703161529482479,"Longitude":13.193039057187006},{"HsaId":"SE162321000255-O22542","FriendlyUrl":"/hitta-vard/kontaktkort/A5Psykoterapi-Gunilla-Lundqvist-Lund/","DisplayName":"A5Psykoterapi - Gunilla Lundqvist, Lund","Address":"Stortorget 1 5:e vån, Lund","PhoneNumber":"070-624 13 97","HasMvkServices":true,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":55.703161529482479,"Longitude":13.193039057187006},{"HsaId":"SE2321000057-6SV4","FriendlyUrl":"/hitta-vard/kontaktkort/A6-Ogonklinik-AB/","DisplayName":"A6 Ögonklinik AB","Address":"Batterigatan 9 NB, Jönköping","PhoneNumber":"036-860 20 30","HasMvkServices":true,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":57.768032303027383,"Longitude":14.202798620555548},{"HsaId":"SE162321000024-0059892","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Evelina-Linder-KBT/","DisplayName":"AB Evelina Linder KBT","Address":"Drottninggatan 1A, Uppsala","PhoneNumber":"073-593 00 73","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.858328320441558,"Longitude":17.638292776307694},{"HsaId":"SE162321000024-0052597","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Forsberg-KBT-konsult/","DisplayName":"AB Forsberg KBT-konsult","Address":"Trädgårdsgatan 5A, Uppsala","PhoneNumber":"070-818 17 11","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.856845411620185,"Longitude":17.635819529969204},{"HsaId":"SE2321000016-C7H4","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Lyhord-Ostermalmstorg/","DisplayName":"AB Lyhörd - Östermalmstorg","Address":"Östermalmstorg 1,STOCKHOLM","PhoneNumber":"08-425 004 00","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.336237708592563,"Longitude":18.079317099784653},{"HsaId":"SE2321000016-BH0B","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Suavis-horsel-Solna-Business-park/","DisplayName":"AB Suavis hörsel, Solna Business park","Address":"Svetsarvägen 15,2 tr,SOLNA","PhoneNumber":"010-207 11 77","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.35928477168008,"Longitude":17.980058512140353},{"HsaId":"SE2321000016-56DM","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Vackra-Tander-Annette-Goransson/","DisplayName":"AB Vackra Tänder Annette Göransson","Address":"Drottninggatan 71A,STOCKHOLM","PhoneNumber":"08-21 52 62","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.33592153903674,"Longitude":18.059258535271329},{"HsaId":"SE5564844115-106Q","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Vackra-Tander-Norrmalm/","DisplayName":"AB Vackra Tänder, Norrmalm","Address":"Drottninggatan 71 A, 3 tr,","PhoneNumber":"08-21 52 62","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.33592396728109,"Longitude":18.059118082991937},{"HsaId":"SE2321000016-97P2","FriendlyUrl":"/hitta-vard/kontaktkort/ABA-Ogonklinik-i-Alvik/","DisplayName":"ABA Ögonklinik i Alvik","Address":"Tranebergsplan 3,,BROMMA","PhoneNumber":"08-124 440 10","HasMvkServices":true,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.33516807973394,"Longitude":17.978288641135208}],"HasZeroHits":false}

我会很感激一些能让我开始的初始代码行


Tags: falseabaddressdistancephonenumberlatitudelongitudedisplayname
2条回答

这段代码可能有效,也可能无效,但鉴于您面临的问题,我将采用这种方法。您可以在起始url中插入{}以使用该格式。此外,当您循环数据['quotes']时,您现在处理的是一个JSON对象,而不是一个粗糙的选择器。因此,不需要调用.get()

import json
import scrapy


class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    start_urls = ['https://www.1177.se/api/hjv/search?batchsize=10&caretype=&componentname&cs=false&location=&p={}&q=&s=name&sortorder=name&st=4af2ed43-1154-4363-ae6b-718f9b84d23a']

    def start_requests(self):
        # You may also need to replicate the headers used in the requests made to this URL.
        yield scrapy.Request(self.start_urls[0].format('1'))

    def parse(self, response):
        data = json.loads(response.body)
        for item in data['quotes']:
            # remember you're no longer dealing with a scrapy selector but now a json object
            yield {
                'text': item['text'],
                'author': item['name'],
                'tags': item['tags'],
            }
        if data['has_next']:
            # convert to integer to do addition
            next_page = int(data['page']) + 1
            yield scrapy.Request(self.start_urls[0].format(next_page), callback=self.parse)

这应该可以做到:

Headerz = {
    'accept': 'text/html, */*; q=0.01',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'pragma': 'no-cache',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
}

class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    start_urls = ['https://www.1177.se/api/hjv/search?batchsize=10&caretype=&componentname&cs=false&location=&p={}&q=&s=name&sortorder=name&st=4af2ed43-1154-4363-ae6b-718f9b84d23a']

    def start_requests(self):
        # You may also need to replicate the headers used in the requests made to this URL.
        yield scrapy.Request(self.start_urls[0].format('1'), headers=Headerz)

    def parse(self, response):
        data = json.loads(response.body)
        # you have json data in data variable, do what you intent to do so
        try:
            # paginate
            if not data['NextPage'] is None:
                nextpage_number = data['NextPage']
                nexturl = self.start_urls[0].format( str(nextpage_number) )
                yield scrapy.Request(nexturl, headers=Headerz)
        except:
            pass

这里的诀窍是使用正确的标题

相关问题 更多 >