<p>首先,IMDb <a href="http://www.imdb.com/conditions" rel="nofollow">"Conditions of Use"</a>明确禁止屏幕抓取:</p>
<blockquote>
<p>Robots and Screen Scraping: You may not use data mining, robots,
screen scraping, or similar data gathering and extraction tools on
this site, except with our express written consent as noted below.</p>
</blockquote>
<p>尝试<em>探索IMDb JSON API</em>而不是web抓取方法。在</p>
<hr/>
<p>您当前的问题是-在特定日期出生的人的列表是通过对<code>IMDb</code>API</em>的单独调用加载的,并涉及到<em>javascript逻辑</em>。在</p>
<p>现在最简单的选择是切换到<a href="http://selenium-python.readthedocs.org/" rel="nofollow">^{<cd2>}</a>浏览器自动化工具。使用<em>headless<code>PhantomJS</code>浏览器</em>的工作示例:</p>
<pre><code>from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("http://m.imdb.com/feature/bornondate")
# waiting for posters to load
wait = WebDriverWait(driver, 10)
posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))
# extracting the data poster by poster
for a in posters.find_elements_by_css_selector('a.poster'):
img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'
person = a.find_element_by_css_selector('div.detail').text
title = a.find_element_by_css_selector('span.title').text
print img, person, title
</code></pre>
<p>印刷品:</p>
^{pr2}$