擅长:python、mysql、java
<p>尝试<a href="https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.XMLFeedSpider.parse_node" rel="nofollow noreferrer">XMLFeedSpider</a></p>
<pre><code>from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))
item = TestItem()
item['id'] = node.xpath('@id').extract()
item['name'] = node.xpath('name').extract()
item['description'] = node.xpath('description').extract()
return item
</code></pre>
<p>或者使用Regex来提取所有url</p>
^{pr2}$