我的目标是刮this URL
列表中的每个项目都链接到有关它的更多信息。我的目标是清除所有17000个链接页面。只显示10个结果,并且查看更多按钮发出请求,通过JSON向列表中添加10个以上的结果。我试图通过更改batchsize来修改请求,该参数用于定义列表中结果的数量,但没有起作用。我还尝试使用此代码(从atutorial),但无法将其调整为适合我的特定任务:
import json
import scrapy
class SpidyQuotesSpider(scrapy.Spider):
name = 'spidyquotes'
quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
start_urls = [quotes_base_url % 1]
download_delay = 1.5
def parse(self, response):
data = json.loads(response.body)
for item in data.get('quotes', []):
yield {
'text': item.get('text'),
'author': item.get('author', {}).get('name'),
'tags': item.get('tags'),
}
if data['has_next']:
next_page = data['page'] + 1
yield scrapy.Request(self.quotes_base_url % next_page)
我看了一些例子here、here和here。然而,经过2天的尝试,我仍然不知道如何解决这个问题,因为我希望抓取的网站上的URL请求与所有示例不同,而且似乎它们使抓取变得更加困难
点击查看更多发出的请求如下:
返回的JSON具有以下格式:
{"Heading":"17952 träffar på Alla mottagningar","Query":"","Region":null,"NextPage":3,"Page":2,"BatchSize":10,"BatchText":"Visa 10 till","TotalHits":17952,"SortOrder":"name","Latitude":0.0,"Longitude":0.0,"Bounds":null,"SearchHits":[{"HsaId":"SE162321000255-O23228","FriendlyUrl":"/hitta-vard/kontaktkort/A5-Psykoterapi-Katia-Karlsson-Carli-AB-Lund/","DisplayName":"A5 Psykoterapi Katia Karlsson Carli AB, Lund","Address":"Stortorget 1, Lund","PhoneNumber":"073-046 26 68","HasMvkServices":true,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":55.703161529482479,"Longitude":13.193039057187006},{"HsaId":"SE162321000255-O22542","FriendlyUrl":"/hitta-vard/kontaktkort/A5Psykoterapi-Gunilla-Lundqvist-Lund/","DisplayName":"A5Psykoterapi - Gunilla Lundqvist, Lund","Address":"Stortorget 1 5:e vån, Lund","PhoneNumber":"070-624 13 97","HasMvkServices":true,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":55.703161529482479,"Longitude":13.193039057187006},{"HsaId":"SE2321000057-6SV4","FriendlyUrl":"/hitta-vard/kontaktkort/A6-Ogonklinik-AB/","DisplayName":"A6 Ögonklinik AB","Address":"Batterigatan 9 NB, Jönköping","PhoneNumber":"036-860 20 30","HasMvkServices":true,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":57.768032303027383,"Longitude":14.202798620555548},{"HsaId":"SE162321000024-0059892","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Evelina-Linder-KBT/","DisplayName":"AB Evelina Linder KBT","Address":"Drottninggatan 1A, Uppsala","PhoneNumber":"073-593 00 73","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.858328320441558,"Longitude":17.638292776307694},{"HsaId":"SE162321000024-0052597","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Forsberg-KBT-konsult/","DisplayName":"AB Forsberg KBT-konsult","Address":"Trädgårdsgatan 5A, Uppsala","PhoneNumber":"070-818 17 11","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.856845411620185,"Longitude":17.635819529969204},{"HsaId":"SE2321000016-C7H4","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Lyhord-Ostermalmstorg/","DisplayName":"AB Lyhörd - Östermalmstorg","Address":"Östermalmstorg 1,STOCKHOLM","PhoneNumber":"08-425 004 00","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.336237708592563,"Longitude":18.079317099784653},{"HsaId":"SE2321000016-BH0B","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Suavis-horsel-Solna-Business-park/","DisplayName":"AB Suavis hörsel, Solna Business park","Address":"Svetsarvägen 15,2 tr,SOLNA","PhoneNumber":"010-207 11 77","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.35928477168008,"Longitude":17.980058512140353},{"HsaId":"SE2321000016-56DM","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Vackra-Tander-Annette-Goransson/","DisplayName":"AB Vackra Tänder Annette Göransson","Address":"Drottninggatan 71A,STOCKHOLM","PhoneNumber":"08-21 52 62","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.33592153903674,"Longitude":18.059258535271329},{"HsaId":"SE5564844115-106Q","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Vackra-Tander-Norrmalm/","DisplayName":"AB Vackra Tänder, Norrmalm","Address":"Drottninggatan 71 A, 3 tr,","PhoneNumber":"08-21 52 62","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.33592396728109,"Longitude":18.059118082991937},{"HsaId":"SE2321000016-97P2","FriendlyUrl":"/hitta-vard/kontaktkort/ABA-Ogonklinik-i-Alvik/","DisplayName":"ABA Ögonklinik i Alvik","Address":"Tranebergsplan 3,,BROMMA","PhoneNumber":"08-124 440 10","HasMvkServices":true,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.33516807973394,"Longitude":17.978288641135208}],"HasZeroHits":false}
我会很感激一些能让我开始的初始代码行
这段代码可能有效,也可能无效,但鉴于您面临的问题,我将采用这种方法。您可以在起始url中插入{}以使用该格式。此外,当您循环数据['quotes']时,您现在处理的是一个JSON对象,而不是一个粗糙的选择器。因此,不需要调用.get()
这应该可以做到:
这里的诀窍是使用正确的标题
相关问题 更多 >
编程相关推荐