使用BeautifulSoup对IMDb页面进行web浏览

2024-06-26 00:19:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我对WebScraping/Python和beauthoulsoup还不熟悉,我的代码很难正常工作。在

我想通过url:http://m.imdb.com/feature/bornondate“获取:

  • 名人的名字
  • 名人形象
  • 专业
  • 最好的作品

为那一页上的十位名人。我不知道我做错了什么。在

这是我的代码:

import urllib2
from bs4 import BeautifulSoup

url = 'http://m.imdb.com/feature/bornondate'

test_url = urllib2.urlopen(url)
readHtml = test_url.read()
test_url.close()

soup = BeautifulSoup(readHtml)
# Using it track the number of Actor
count = 0
# Fetching the value present within tag results
person = soup.findChildren('section', 'posters list')
# Changing the person into an iterator
iterperson = iter(person[0].findChildren('a'))

# Finding 'a' in iterperson. Every 'a' tag contains information of a person
for a in iterperson:
    imgSource = a.find('img')['src'].split('._V1.')[0] + '._V1_SX214_AL_.jpg'
    person = a.findChildren('div', 'label')
    title = person[0].find('span', 'title').contents[0]
    ##profession = person[0].find('div', 'detail').contents[0].split(,)
    ##bestWork = person[0].find('div', 'detail').contents[1].split(,)

    print '*******************************IMDB People Born Today***********************************'
    # Printing the S.No of the person
    print 'S.No. --> ',
    count += 1
    print count
    # Printing the title/name of the person
    print 'Title --> ' + title
    # Printing the Image Source of the person
    print 'Image Source --> ', imgSource
    # Printing the Profession of the person
    ##print 'Profession --> ', profession
    # Printing the Best work of the person
    ##print 'Best Work --> ', bestWork

目前没有打印出来。 还有,如果这是模糊的,你能解释一下如何做名人的名字,例如?在

下面是第一位名人的html代码,如果有帮助的话:

^{pr2}$

Tags: ofthe代码testurltitlecountfind
2条回答

首先,IMDb "Conditions of Use"明确禁止屏幕抓取:

Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.

尝试探索IMDb JSON API而不是web抓取方法。在


您当前的问题是-在特定日期出生的人的列表是通过对IMDbAPI的单独调用加载的,并涉及到javascript逻辑。在

现在最简单的选择是切换到^{}浏览器自动化工具。使用headlessPhantomJS浏览器的工作示例:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get("http://m.imdb.com/feature/bornondate")

# waiting for posters to load
wait = WebDriverWait(driver, 10)
posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))

# extracting the data poster by poster
for a in posters.find_elements_by_css_selector('a.poster'):
    img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'

    person = a.find_element_by_css_selector('div.detail').text
    title = a.find_element_by_css_selector('span.title').text

    print img, person, title

印刷品:

^{pr2}$

我在做同样的任务。URLlib库加载web URL的静态内容。使用selenium获得完整的html,其中也包括动态内容。如果使用urllib2库,生成的html将

<span class="loading"></span>

希望有帮助。在

相关问题 更多 >