<p><em>(我将此发布到了<code>scrapy-users</code>邮件列表,但根据Paul的建议,我将其发布在这里,因为它通过<code>shell</code>命令交互补充了答案。)</p>
<p>通常,使用第三方服务呈现某些数据可视化(地图、表格等)的网站必须以某种方式发送数据,并且在大多数情况下,可以从浏览器访问这些数据。</p>
<p>对于这种情况,检查(即浏览浏览器发出的请求)显示数据是从POST请求加载到<a href="https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php">https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php</a></p>
<p>所以,基本上你已经准备好了所有你想要的数据,用一个很好的json格式。</p>
<p>Scrapy提供了<code>shell</code>命令,在编写spider之前,thinker非常方便地使用该网站:</p>
<pre><code>$ scrapy shell https://www.mcdonalds.com.sg/locate-us/
2013-09-27 00:44:14-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: scrapybot)
...
In [1]: from scrapy.http import FormRequest
In [2]: url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'
In [3]: payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}
In [4]: req = FormRequest(url, formdata=payload)
In [5]: fetch(req)
2013-09-27 00:45:13-0400 [default] DEBUG: Crawled (200) <POST https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php> (referer: None)
...
In [6]: import json
In [7]: data = json.loads(response.body)
In [8]: len(data['stores']['listing'])
Out[8]: 127
In [9]: data['stores']['listing'][0]
Out[9]:
{u'address': u'678A Woodlands Avenue 6<br/>#01-05<br/>Singapore 731678',
u'city': u'Singapore',
u'id': 78,
u'lat': u'1.440409',
u'lon': u'103.801489',
u'name': u"McDonald's Admiralty",
u'op_hours': u'24 hours<br>\r\nDessert Kiosk: 0900-0100',
u'phone': u'68940513',
u'region': u'north',
u'type': [u'24hrs', u'dessert_kiosk'],
u'zip': u'731678'}
</code></pre>
<p>简而言之:在spider中,您必须返回上面的<code>FormRequest(...)</code>,然后在回调中从<code>response.body</code>加载json对象,最后为列表中每个存储的数据创建一个具有所需值的项。</p>
<p>像这样的:</p>
<pre><code>class McDonaldSpider(BaseSpider):
name = "mcdonalds"
allowed_domains = ["mcdonalds.com.sg"]
start_urls = ["https://www.mcdonalds.com.sg/locate-us/"]
def parse(self, response):
# This receives the response from the start url. But we don't do anything with it.
url = 'https://www.mcdonalds.com.sg/wp-admin/admin-ajax.php'
payload = {'action': 'ws_search_store_location', 'store_name':'0', 'store_area':'0', 'store_type':'0'}
return FormRequest(url, formdata=payload, callback=self.parse_stores)
def parse_stores(self, response):
data = json.loads(response.body)
for store in data['stores']['listing']:
yield McDonaldsItem(name=store['name'], address=store['address'])
</code></pre>