不使用Python生成结果的Web Scraper

from urllib import urlopen import re url = urlopen('http://www.realclearpolitics.com/epolls/2012/senate/ma/massachusetts_senate_brown_vs_warren-2093.html#polls').read() ''' a href="http://multimedia.heraldinteractive.com/misc/umlrvnov2012final.pdf">Title a> ''' A = 'a href.*pdf">(expression to pull everything) a>' B = re.compile(A) C = re.findall(B,url) print C

2条回答

网友

1楼 · 编辑于 2024-07-07 07:36:37

这在这里经常出现。与其使用正则表达式，不如使用允许搜索/遍历文档树的HTML解析器。你知道吗

我会使用BeautifulSoup：

Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text."

>>> from bs4 import BeautifulSoup
>>> html = ? # insert your raw HTML here
>>> soup = BeautifulSoup(html)
>>> a_tags = soup.find_all("a")
>>> for anchor in a_tags:
>>> ...     print anchor.contents

网友

2楼 · 编辑于 2024-07-07 07:36:37

我将回应关于不使用RegEx解析HTML的另一条评论，但有时它是快速而简单的。您的示例中的HTML看起来不太正确，但我会尝试以下方法：

re.findall('href.*?pdf">(.+?)<\/a>', A)

相关问题更多 >

编程相关推荐

热门问题

热门文章