如何使用scrapy的XmlFeedSpider解析sitemap.xml文件？

<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:video="http://www.sitemaps.org/schemas/sitemap-video/1.1"> <url> <loc> http://www.site.com/page.html </loc> <video:video> <video:thumbnail_loc> http://www.site.com/thumb.jpg </video:thumbnail_loc> <video:content_loc>http://www.example.com/video123.flv</video:content_loc> <video:player_loc allow_embed="yes" autoplay="ap=1"> http://www.example.com/videoplayer.swf?video=123 </video:player_loc> <video:title>here is the page title</video:title> <video:description>and an awesome description</video:description> <video:duration>302</video:duration> <video:publication_date>2011-02-24T02:03:43+02:00</video:publication_date> <video:tag>w00t</video:tag> <video:tag>awesome</video:tag> <video:tag>omgwtfbbq</video:tag> <video:tag>kthxby</video:tag> </video:video> </url> </urlset>

class SitemapSpider(XMLFeedSpider): name = "sitemap" namespaces = [ ('', 'http://www.sitemaps.org/schemas/sitemap/0.9'), ('video', 'http://www.sitemaps.org/schemas/sitemap-video/1.1'), ] start_urls = ["http://example.com/sitemap.xml"] itertag = 'url' def parse_node(self, response, node): print "Parsing: %s" % str(node)

File "/.../python2.7/site-packages/scrapy/utils/iterators.py", line 32, in xmliter yield XmlXPathSelector(text=nodetext).select('//' + nodename)[0] exceptions.IndexError: list index out of range

class SitemapSpider(BaseSpider): name = 'sitemap' namespaces = { 'sitemap': 'http://www.sitemaps.org/schemas/sitemap/0.9', 'video': 'http://www.sitemaps.org/schemas/sitemap-video/1.1', } def parse(self, response): xxs = XmlXPathSelector(response) for namespace, schema in self.namespaces.iteritems(): xxs.register_namespace(namespace, schema) for urlnode in xxs.select('//sitemap:url'): extract_datas_here()

2条回答

网友

1楼 · 编辑于 2024-05-19 10:54:50

我发现hxs和xxs的区别是有帮助的。我发现很难找到xxs物体。我想用这个

x = XmlXPathSelector(response)

当这些对我的需要起到更好的作用时。

hxs.select('//p/text()').extract()

或者

xxs.select('//title/text()').extract()

网友

2楼 · 编辑于 2024-05-19 10:54:50

Scrapy在hood下使用lxml/libxml2，最终调用node.xpath()方法来执行选择。xpath表达式中的任何具有名称空间的元素都必须加上前缀，并且必须传递一个映射来告诉选择器每个前缀解析到哪个名称空间。

下面是一个示例，演示如何在使用node.xpath()方法时将前缀映射到命名空间：

doc = '<root xmlns="chaos"><bar /></root>'
tree = lxml.etree.fromstring(doc)
tree.xpath('//bar')
[]
tree.xpath('//x:bar', namespaces={'x': 'chaos'})
[<Element {chaos}bar at 7fa40f9c50a8>]

如果没有使用这个蹩脚的XMLFeedSpider类，我猜您的命名空间映射和itertag需要遵循相同的方案：

class SitemapSpider(XMLFeedSpider):
    namespaces = [
        ('sm', 'http://www.sitemaps.org/schemas/sitemap/0.9'),
        ]
     itertag = 'sm:url'

相关问题更多 >

编程相关推荐

热门问题

热门文章