<p>为了从相对链接中获取绝对URL,可以使用Scrapy<a href="https://doc.scrapy.org/en/latest/topics/request-response.html?#scrapy.http.Response.urljoin" rel="nofollow noreferrer">urljoin()</a>方法,并按如下方式重写代码:</p>
<pre><code>import scrapy
class PdgaSpider(scrapy.Spider):
name = "pdgavideos"
start_urls = ["http://www.pdga.com/videos/"]
def parse(self, response):
for link in response.xpath('//td[2]/a/@href').extract():
yield scrapy.Request(response.urljoin(link), callback=self.parse_page)
# If page contains link to next page extract link and parse
next_page = response.xpath('//a[contains(., "next")]/@href').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
def parse_page(self, response):
link = response.xpath('//iframe/@src').extract_first()
yield{
'you_tube_link': 'http:' + link.split('?')[0]
}
# To save links in csv format print in console: scrapy crawl pdgavideos -o links.csv
# http://www.youtube.com/embed/tYBF-BaqVJ8
# http://www.youtube.com/embed/_H0hBBc1Azg
# http://www.youtube.com/embed/HRbKFRCqCos
# http://www.youtube.com/embed/yz3D1sXQkKk
# http://www.youtube.com/embed/W7kuKe2aQ_c
</code></pre>