不使用Python生成结果的Web Scraper

2024-07-07 07:36:37 发布

您现在位置:Python中文网/ 问答频道 /正文


from urllib import urlopen

import re

url = urlopen('http://www.realclearpolitics.com/epolls/2012/senate/ma/massachusetts_senate_brown_vs_warren-2093.html#polls').read()

a href="http://multimedia.heraldinteractive.com/misc/umlrvnov2012final.pdf">Title a>

A = 'a href.*pdf">(expression to pull everything) a>' 

B = re.compile(A) 

C = re.findall(B,url)

print C

Tags: 代码importrecomhttpurl标题pdf



Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

>>> from bs4 import BeautifulSoup
>>> html = ? # insert your raw HTML here
>>> soup = BeautifulSoup(html)
>>> a_tags = soup.find_all("a")
>>> for anchor in a_tags:
>>> ...     print anchor.contents


re.findall('href.*?pdf">(.+?)<\/a>', A)

相关问题 更多 >