靠oncli抓取数据

2024-05-15 20:20:56 发布

您现在位置：Python中文网/ 问答频道 /正文

518

网友

男 | 程序猿一只，喜欢编程写python代码。

我想在这个链接中提取每篇论文的标题和pdf链接：https://iclr.cc/Conferences/2019/Schedule?type=Poster

我的密码在这里

class ICLRCrawler(Spider):
    name = "ICLRCrawler"
    allowed_domains = ["iclr.cc"]
    start_urls = ["https://iclr.cc/Conferences/2019/Schedule?type=Poster", ]

    def parse(self, response):
        papers = Selector(response).xpath('//*[@id="content"]/div/div[@class="paper"]')
        titles = Selector(response).xpath('//*[@id="maincard_704"]/div[3]')
        links = Selector(response).xpath('//*[@id="maincard_704"]/div[6]/a[2]')
        for title, link in zip(titles, links):
            item = PapercrawlerItem()
            item['title'] = title.xpath('text()').extract()[0]
            item['pdf'] = link.xpath('/@href').extract()[0]
            item['sup'] = ''
            yield item

然而，要获得每篇论文的标题和链接似乎并不容易。在这里，如何更改代码以获取数据？你知道吗

Tags： https div id 标题 pdf title 链接 response

2条回答

网友

1楼 · 编辑于 2024-05-15 20:20:56

您可以使用更简单的方法：

def parse(self, response):

    for poster in response.xpath('//div[starts-with(@id, "maincard_")]'):
        item = PapercrawlerItem()
        item["title"] = poster.xpath('.//div[@class="maincardBody"]/text()[1]').get()
        item["pdf"] = poster.xpath('.//a[@title="PDF"]/@href').get()

        yield item

网友

2楼 · 编辑于 2024-05-15 20:20:56

必须用get_attribute('href')替换Extract()[0]

靠oncli抓取数据

相关问题更多 >

编程相关推荐

热门问题

热门文章

靠oncli抓取数据

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >