Trulia爬虫工具集
crawl_trulia的Python项目详细描述
欢迎使用Crawl_Trulia文档
这是一个小项目,提供url路由,html解析工具来抓取www.trulia.com。
用法
一个真实的例子:
>>>fromcrawl_trulia.urlencoderimporturlencoder>>>fromcrawl_trulia.htmlparserimporthtmlparser>>>fromcrawlib.spiderimportspider# install crawlib first# use address, city and zipcode>>>address="22 Yew Rd">>>city="Baltimore">>>zipcode="21221">>>url=urlencoder.by_address_city_and_zipcode(address,city,zipcode)>>>html=spider.get_html(url)>>>house_detail_data=htmlparser.get_house_detail(html)>>>house_detail_data{"features":{},"public_records":{"AC":"a/c","basement_type":"improved basement (finished)","bathroom":2,"build_year":1986,"county":"baltimore county","exterior_walls":"siding (alum/vinyl)","heating":"heat pump","lot_size":7505,"lot_size_unit":"sqft","partial_bathroom":1,"roof":"composition shingle","sqft":998}}# usually combination of address and zipcode is enough>>>address="2004 Birch Rd">>>zipcode="21221">>>url=urlencoder.by_address_and_zipcode(address,zipcode)>>>html=spider.get_html(url)>>>house_detail_data=htmlparser.get_house_detail(html)
安装
crawl_trulia在pypi上发布,所以您只需要:
$ pip install crawl_trulia
要升级到最新版本:
$ pip install --upgrade crawl_trulia