在python中，Scraping只获得Scrapy的第一条记录

import scrapy class HamburgSpider(scrapy.Spider): name = 'hamburg' allowed_domains = ['https://www.hamburg.de'] start_urls = ['https://www.hamburg.de/branchenbuch/hamburg/10239785/n0/'] custom_settings = { 'FEED_EXPORT_FORMAT': 'utf-8' } def parse(self, response): items = response.xpath("//div[starts-with(@class, 'item')]") for item in items: business_name = item.xpath(".//h3[@class='h3rb']/text()").get() address1 = item.xpath(".//div[@class='address']/p[@class='extra post']/text()[1]").get() address2 = item.xpath(".//div[@class='address']/p[@class='extra post']/text()[2]").get() phone = item.xpath(".//div[@class='address']/span[@class='extra phone']/text()").get() yield { 'Business Name': business_name, 'Address1': address1, 'Address2': address2, 'Phone Number': phone } next_page_url = 'https://www.hamburg.de' + response.xpath("//li[@class='next']/a/@href").get() if next_page_url: next_page_url = response.urljoin(next_page_url) yield scrapy.Request(url=next_page_url, callback=self.parse)

next_page_url = response.xpath("//li[@class='next']/@href").get() if next_page_url: next_page_url = response.urljoin(next_page_url) yield scrapy.Request(url=next_page_url, callback=self.parse)

{'Business Name': ' A & Z Kfz Meisterbetrieb GmbH ', 'Address1': ' Anckelmannstraße 13', 'Address2': ' 20537 Hamburg (Borgfelde) ', 'Phone Number': '040 / 236 882 10 '} 2020-11-10 19:55:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.hamburg.de/branchenbuch/hamburg/10239785/n0/> {'Business Name': ' A+B Automobile ', 'Address1': ' Kuehnstraße 19', 'Address2': ' 22045 Hamburg (Tonndorf) ', 'Phone Number': '040 / 696 488-0 '} 2020-11-10 19:55:10 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.hamburg.de': <GET https://www.hamburg.de/branchenbuch/hamburg/10239785/n20/> 2020-11-10 19:55:10 [scrapy.core.engine] INFO: Closing spider (finished) 2020-11-10 19:55:10 [scrapy.extensions.feedexport] INFO: Stored json feed (20 items) in: output.json 2020-11-10 19:55:10 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 247, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 50773, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 2.222001, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 11, 10, 17, 55, 10, 908399), 'item_scraped_count': 20, 'log_count/DEBUG': 22, 'log_count/INFO': 11, 'log_count/WARNING': 1, 'offsite/domains': 1, 'offsite/filtered': 1, 'request_depth_max': 1, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2020, 11, 10, 17, 55, 8, 686398)} 2020-11-10 19:55:10 [scrapy.core.engine] INFO: Spider closed (finished)

1条回答

网友

1楼 · 发布于 2024-06-23 19:39:47

问题就在这里

item.xpath("//h3[@class='h3rb']/text()").get()

当我们想要访问scrapy中的嵌套选择器时，我们必须使用(".//")而不是("//")。尝试按如下方式更改代码

business_name = item.xpath(".//h3[@class='h3rb']/text()").get()
address1 = item.xpath(".//div[@class='address']/p[@class='extra post']/text()[1]").get()
address2 = item.xpath(".//div[@class='address']/p[@class='extra post']/text()[2]").get()
phone = item.xpath(".//div[@class='address']/span[@class='extra phone']/text()").get()

希望它能如你所愿

相关问题更多 >

编程相关推荐

热门问题

热门文章