垃圾雅虎集团spid

class YgroupSpider(CrawlSpider): name = "yahoo.com" allowed_domains = ["launch.groups.yahoo.com"] start_urls = [ "http://launch.groups.yahoo.com/group/random_public_ygroup/post" ] rules = ( Rule(SgmlLinkExtractor(allow=('message','messages' ), deny=('mygroups', ))), Rule(SgmlLinkExtractor(), callback='parse_item'), ) def parse_item(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('/html') item = Item() for site in sites: item = YgroupItem() item['title'] = site.select('//title').extract() item['pubDate'] = site.select('//abbr[@class="updated"]/text()').extract() item['desc'] = site.select("//div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]/text()").extract() return item

1条回答

网友

1楼 · 发布于 2024-06-28 15:06:01

看起来你几乎不知道你在做什么。我是新来的，但我想你会想要 Rule(SgmlLinkExtractor(allow=('http\://example\.com/message/.*\.aspx', )), callback='parse_item'), 尝试编写一个正则表达式来匹配所需的完整链接URL。而且，看起来你只需要一条规则。将回调添加到第一个回调。链接提取器匹配allow中与正则表达式匹配的每个链接，并从那些不包括通过deny匹配的链接，然后从那里加载剩余的每个页面并将其传递给parse_item。在

我所说的这些都是在不了解数据挖掘页面和所需数据性质的情况下进行的。你需要这样的蜘蛛为一个页面，它有链接到那些有你想要的数据的页面。在

相关问题更多 >

编程相关推荐

热门问题

热门文章