用LinkEx刮刮网站地图

<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> </urlset>

2条回答

网友

1楼 · 编辑于 2024-06-26 00:01:48

在这种情况下，您可以使用bs4。在

from bs4 import BeautifulSoup as bs

XML = ''' <?xml version="1.0" encoding..... '''

bs=bs(XML)
urlset_tag = bs.find_all('urlset') 
##out: list with one element  > [<urlset xmlns="http://www.si....]

link = urlset_tag[0].find_all('loc')
##out: [<loc>http://www.example.com/</loc>]

link_str=str(link[0].text)
##out:'http://www.example.com/'

如果您有多个标记urlset，则应该执行一个循环，因为列表长度将大于1：

^{pr2}$

网友

2楼 · 编辑于 2024-06-26 00:01:48

尝试XMLFeedSpider

from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem

class MySpider(XMLFeedSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/feed.xml']
    iterator = 'iternodes'  # This is actually unnecessary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, node):
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))

        item = TestItem()
        item['id'] = node.xpath('@id').extract()
        item['name'] = node.xpath('name').extract()
        item['description'] = node.xpath('description').extract()
        return item

或者使用Regex来提取所有url

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章

用LinkEx刮刮网站地图

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >