使用mechanize和beautiful soup在python中进行原始HTML与DOM抓取

from urllib import urlopen from BeautifulSoup import BeautifulSoup import mechanize webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults' br = mechanize.Browser() data = br.open(webpage).get_data() soup = BeautifulSoup(data) print soup

2条回答

网友

1楼 · 编辑于 2024-10-01 05:05:05

在python中，Mechanize和Beautiful soup是不可击败的web清除工具。在

但你需要明白什么是什么意思：

Mechanize：它模仿网页上的浏览器功能。在

BeautifulSoup:HTML解析器，即使在HTML格式不好的情况下也能正常工作。在

你的问题似乎是javascript。价格是通过使用javascript的ajax调用填充的。^然而，{}不执行javascript，因此javascript产生的任何内容对于mechanize都是不可见的。在

看看这个：http://github.com/davisp/python-spidermonkey/tree/master

这是一个使用js执行的mechanize和Beautiful soup的包装器。在

网友

2楼 · 编辑于 2024-10-01 05:05:05

回答我自己的问题，因为从问起这些年来，我学到了很多。今天我将使用Selenium Webdriver来完成这项工作。Selenium正是我在2012年为这种类型的web抓取项目寻找的工具。在

https://www.seleniumhq.org/download/

http://chromedriver.chromium.org/

相关问题更多 >

编程相关推荐

热门问题

热门文章