我有以下脚本递归爬网一个网站:
#!/usr/bin/python
import scrapy
from scrapy.selector import Selector
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
class GivenSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/",
# "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
# "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
rules = (Rule(LinkExtractor(allow=r'/'), callback=parse, follow=True),)
def parse(self, response):
select = Selector(response)
titles = select.xpath('//a[@class="listinglink"]/text()').extract()
print ' [*] Start crawling at %s ' % response.url
for title in titles:
print '\t %s' % title
#configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(GivenSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()
当我调用它时:
^{pr2}$
洛伊克·福雷·拉克鲁瓦是对的。但是在当前版本的scray(1.6)中,您需要像这样从}:
scrapy.spiders
导入{See documentation for more information
如果您查看文档并搜索单词规则,您会发现:
http://doc.scrapy.org/en/0.20/topics/spiders.html?highlight=rule#crawling-rules
由于您没有导入任何内容,很明显规则没有被定义。在
因此,理论上,您应该能够使用}类
from scrapy.contrib.spiders import Rule
导入{相关问题 更多 >
编程相关推荐