Beautifulsoup不处理特定的si

2024-06-28 19:01:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试解析this site,但由于我无法理解的原因,什么都没有发生。在

url = 'http://www.zap.com.br/imoveis/rio-de-janeiro+rio-de-janeiro/apartamento-padrao/venda/'
response = urllib2.urlopen(url).read()
doc = BeautifulSoup(response)
divs = doc.findAll('div')
print len(divs) # prints 0.

这个网站是巴西里约热内卢的一个真实的州广告。我在html源代码中找不到任何可以阻止beauthulsoup工作的东西。会是这个尺寸吗?在

我使用的是热情的Canopy Python 2.7.6、IPython Notebook 2.0、BeautifulSoup4.3.2。在


Tags: brcomhttpurldocresponsewwwsite
2条回答

您的环境有问题,以下是我得到的输出:

>>> url = 'http://www.zap.com.br/imoveis/rio-de-janeiro+rio-de-janeiro/apartamento-padrao/venda/'
>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> response = urllib2.urlopen(url).read()
>>> doc = BeautifulSoup(response)
>>> divs = doc.findAll('div')
>>> print len(divs) # prints 0.
558

这是因为您让BeautifulSoup为您选择最合适的解析器。而且,这实际上取决于在python环境中安装了什么模块。在

根据documentation

The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is how you’d like the markup parsed.

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks lxml’s parser as being the best, then html5lib’s, then Python’s built-in parser.

所以,不同的解析器-不同的结果:

>>> from bs4 import BeautifulSoup
>>> url = 'http://www.zap.com.br/imoveis/rio-de-janeiro+rio-de-janeiro/apartamento-padrao/venda/'
>>> import urllib2
>>> response = urllib2.urlopen(url).read()
>>> len(BeautifulSoup(response, 'lxml').find_all('div'))
558
>>> len(BeautifulSoup(response, 'html.parser').find_all('div'))
558
>>> len(BeautifulSoup(response, 'html5lib').find_all('div'))
0

解决方案是指定一个解析器来处理这个特定页面的解析,您可能需要安装^{}或{a3}。在

另请参见:Differences between parsers。在

相关问题 更多 >