擅长:python、mysql、java
<p><code>SitemapSpider</code>需要XML站点地图格式,导致爬行器退出并出现以下错误:</p>
<p><code>[scrapy.spiders.sitemap] WARNING: Ignoring invalid sitemap: <200 https://legion-216909.appspot.com/sitemap.txt></code></p>
<p>由于您的<code>sitemap.txt</code>文件只是一个简单的列表或url,因此使用string方法拆分它们会更容易。在</p>
<p>例如:</p>
<pre><code>from scrapy import Spider, Request
class MySpider(Spider):
name = "spyder_PAGE"
start_urls = ['https://legion-216909.appspot.com/sitemap.txt']
def parse(self, response):
links = response.text.split('\n')
for link in links:
# yield a request to get this link
print(link)
# https://legion-216909.appspot.com/index.html
# https://legion-216909.appspot.com/content.htm
# https://legion-216909.appspot.com/Dataset/module_4_literature/Unit_1/.DS_Store
</code></pre>