擅长:python、mysql、java
<p>您只需使用以下命令重写来自<code>SitemapSpider</code>的<code>_parse_sitemap(self, response)</code>:</p>
<pre><code>from scrapy import Request
from scrapy.spiders import SitemapSpider
class MySpider(SitemapSpider):
sitemap_urls = [...]
sitemap_rules = [...]
def _parse_sitemap(self, response):
# yield a request for each url in the txt file that matches your filters
urls = response.text.splitlines()
it = self.sitemap_filter(urls)
for loc in it:
for r, c in self._cbs:
if r.search(loc):
yield Request(loc, callback=c)
break
</code></pre>