我是个新手。在我的第一个项目中,我尝试在一个有多个页面的网页上爬行。我从第一页(index=0)获取数据,但无法从以下页面获取数据:
我尝试了不同的Rules
,但对我不起作用
这是我的密码:
import scrapy
from ..items import myfirstItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.item import Field, Item
class myfirstSpider(CrawlSpider):
name = 'myfirst'
start_urls = ["https://www.leroymerlin.es/decoracion-navidena/arboles-navidad"]
allowed_domains= ["leroymerlin.es"]
rules = (
Rule(LinkExtractor(allow= (), restrict_xpaths=('//li[@class="next"]/a'))),
Rule(LinkExtractor(allow= (), restrict_xpaths=('//a[@class="boxCard"]')), callback = 'parse_item', follow = False),
)
def parse_item(self, response):
items = myfirstItem()
product_name = response.css ('.titleTechniqueSheet::text').extract()
items['product_name'] = product_name
yield items
鄙视我已经读了成千上万个同样问题的帖子,没有一个是为我工作的。。需要帮忙吗
*编辑:在@Fura的建议下,我找到了一个更好的解决方案。这就是它的样子:
class myfirstSpider(CrawlSpider):
name = 'myfirst'
start_urls = ["https://www.leroymerlin.es/decoracion-navidena/arboles-navidad?index=%s" % (page_number) for page_number in range(1,20)]
allowed_domains= ["leroymerlin.es"]
rules = (
Rule(LinkExtractor(allow= r'/fp',), callback = 'parse_item'),
)
def parse_item(self, response):
items = myfirstItem()
product_name = response.css ('.titleTechniqueSheet::text').extract()
items['product_name'] = product_name
yield items
目前没有回答
相关问题 更多 >
编程相关推荐