使用BeautifulSoup对IMDb页面进行web浏览

import urllib2 from bs4 import BeautifulSoup url = 'http://m.imdb.com/feature/bornondate' test_url = urllib2.urlopen(url) readHtml = test_url.read() test_url.close() soup = BeautifulSoup(readHtml) # Using it track the number of Actor count = 0 # Fetching the value present within tag results person = soup.findChildren('section', 'posters list') # Changing the person into an iterator iterperson = iter(person[0].findChildren('a')) # Finding 'a' in iterperson. Every 'a' tag contains information of a person for a in iterperson: imgSource = a.find('img')['src'].split('._V1.')[0] + '._V1_SX214_AL_.jpg' person = a.findChildren('div', 'label') title = person[0].find('span', 'title').contents[0] ##profession = person[0].find('div', 'detail').contents[0].split(,) ##bestWork = person[0].find('div', 'detail').contents[1].split(,) print '*******************************IMDB People Born Today***********************************' # Printing the S.No of the person print 'S.No. --> ', count += 1 print count # Printing the title/name of the person print 'Title --> ' + title # Printing the Image Source of the person print 'Image Source --> ', imgSource # Printing the Profession of the person ##print 'Profession --> ', profession # Printing the Best work of the person ##print 'Best Work --> ', bestWork

2条回答

网友

1楼 · 编辑于 2024-06-26 00:19:55

首先，IMDb "Conditions of Use"明确禁止屏幕抓取：

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

尝试探索IMDb JSON API而不是web抓取方法。在

您当前的问题是-在特定日期出生的人的列表是通过对IMDbAPI的单独调用加载的，并涉及到javascript逻辑。在

现在最简单的选择是切换到^{}浏览器自动化工具。使用headlessPhantomJS浏览器的工作示例：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get("http://m.imdb.com/feature/bornondate")

# waiting for posters to load
wait = WebDriverWait(driver, 10)
posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))

# extracting the data poster by poster
for a in posters.find_elements_by_css_selector('a.poster'):
    img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'

    person = a.find_element_by_css_selector('div.detail').text
    title = a.find_element_by_css_selector('span.title').text

    print img, person, title

印刷品：

^{pr2}$

网友

2楼 · 编辑于 2024-06-26 00:19:55

我在做同样的任务。URLlib库加载web URL的静态内容。使用selenium获得完整的html，其中也包括动态内容。如果使用urllib2库，生成的html将

<span class="loading"></span>

希望有帮助。在

相关问题更多 >

编程相关推荐

热门问题

热门文章