<p>{a1}方法是使用^:</p>
<p>创建项目:</p>
<pre><code>scrapy startproject doctors && cd doctors
</code></pre>
<p>定义要加载的数据(<code>items.py</code>):</p>
^{pr2}$
<p>创建蜘蛛。这个<code>basic</code>似乎不适合这个任务:</p>
<pre><code>scrapy genspider -t basic doctors_spider 'coimbatore.com'
</code></pre>
<p>将其更改为返回一个<code>Request</code>对象,直到每个页面都包含医生的信息:</p>
<pre><code>from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from doctors.items import DoctorsItem
from scrapy.http import Request
from urlparse import urljoin
class DoctorsSpiderSpider(BaseSpider):
name = "doctors_spider"
allowed_domains = ["coimbatore.com"]
start_urls = [
'http://www.coimbatore.com/doctors/home.htm'
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
for row in hxs.select('/html/body/center[1]/table[@cellpadding = 0]'):
i = DoctorsItem()
i['doctor_name'] = '|'.join(row.select('./tr[1]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
i['qualification'] ='|'.join( row.select('./tr[2]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
i['membership'] = '|'.join(row.select('./tr[3]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
i['visiting_hospitals'] = '|'.join(row.select('./tr[4]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
i['phone'] = '|'.join(row.select('./tr[5]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
i['consulting_hours'] = '|'.join(row.select('./tr[6]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
i['specialist_in'] = '|'.join(row.select('./tr[7]/td[2]//font[@size = -1]/text()').extract()).replace('\n', ' ')
yield i
for url in hxs.select('/html/body/center[3]//a/@href').extract():
yield Request(urljoin(response.url, url), callback=self.parse)
for url in hxs.select('/html/body//a/@href').extract():
yield Request(urljoin(response.url, url), callback=self.parse)
</code></pre>
<p>运行方式如下:</p>
<pre><code>scrapy crawl doctors_spider -o doctors.csv -t csv
</code></pre>
<p>这将创建一个<code>csv</code>文件,如:</p>
<pre><code>phone,membership,visiting_hospitals,qualification,specialist_in,consulting_hours,doctor_name
(H)00966 4 6222245|(R)00966 4 6230143 ,,Domat Al Jandal Hospital|Al Jouf |Kingdom Of Saudi Arabia ,"MBBS, MS, MCh ( Cardio-Thoracic)",Cardio Thoracic Surgery,,Dr. N. Rajaratnam
210075,FRCS(Edinburgh) FIACS,"SRI RAMAKRISHNA HOSPITAL|CHEST CLINIC,COWLEY BROWN ROAD,R.S.PURAM,CBE-2","MD.,DPPR.,FACP",PULMONOLOGY/ RESPIRATORY MEDICINE,"9-1, 5-8",DR.T.MOHAN KUMAR
+91-422-827784-827790,Member -IAPMR,"Kovai Medical Center & Hospital, Avanashi Road,|Coimbatore-641 014","M.B.B.S., Dip.in. Physical Medicine & Rehabilitation","Neck and Back pain, Joint pain, Amputee Rehabilitation,|Spinal cord Injuries & Stroke",9.00am to 5.00pm (Except Sundays),Dr.Edmund M.D'Couto
+91-422-303352,*********,"206, Puliakulam Road, Coimbatore-641 045","M.B.B.S., M.D., D.V.",Sexually Transonitted Diseases.,5.00pm - 7.00pm,Dr.M.Govindaswamy
...
</code></pre>