一种轻量级爬虫程序,在给定url和关键字的情况下,以html形式或字典形式给出搜索结果。
CrawlerFriend的Python项目详细描述
crawlerfriend
一个轻量级的web爬虫支持python 2.7 字典形式给定的网址和关键字。如果你经常访问一些网站并查找一些关键字 然后这个python包将为您自动执行任务,并且 将结果返回到Web浏览器中的HTML文件中。
安装
pip install CrawlerFriend
如何使用?
所有结果均为HTML格式
import CrawlerFriend
urls = ["http://www.goal.com/","http://www.skysports.com/football","https://www.bbc.com/sport/football"]
keywords = ["Ronaldo","Liverpool","Salah","Real Madrid","Arsenal","Chelsea","Man United","Man City"]
crawler = CrawlerFriend.Crawler(urls, keywords)
crawler.crawl()
crawler.get_result_in_html()
import CrawlerFriend
urls = ["http://www.goal.com/","http://www.skysports.com/football","https://www.bbc.com/sport/football"]
keywords = ["Ronaldo","Liverpool","Salah","Real Madrid","Arsenal","Chelsea","Man United","Man City"]
crawler = CrawlerFriend.Crawler(urls, keywords)
crawler.crawl()
crawler.get_result_in_html()
以上代码将在浏览器中打开以下HTML文档
词典中的所有结果
result_dict = crawler.get_result()
更改默认参数
result_dict = crawler.get_result()
crawlerfriend默认情况下使用四个html标记“title”、“h1”、“h2”、“h3”和max_link_limit=50进行搜索。 但是可以通过向构造函数传递参数来更改它:
crawler = CrawlerFriend.Crawler(urls, keywords, max_link_limit=200, tags=['p','h4'])
crawler.crawl()