利用beautifulsoup和

from BeautifulSoup import BeautifulSoup import urllib2 import re url="http://www.jbhifionline.com.au/support.aspx?post=1&results=10&source=all&bnSearch=Go!&q=ipod&submit=Go" page=urllib2.urlopen(url) soup = BeautifulSoup(page.read()) Item0=soup.findAll('td',{'class':'check_title'})[0] print (Item0.renderContents())

1条回答

网友

1楼 · 发布于 2024-09-30 18:26:57

不要使用.renderContents()；它充其量只是一个调试工具。你知道吗

只要有第一个孩子：

>>> Item0.contents[0]
u'Apple iPod Classic 160GB (Black)\xc2\xa0\r\n\t\t\t\t\t\t\t\t\t\t\t'
>>> Item0.contents[0].strip()
u'Apple iPod Classic 160GB (Black)\xc2'

BeautifulSoup似乎没有正确地猜出编码，因此不间断空格（U+00a0）以两个独立的字节而不是一个字节的形式出现。看来美女苏猜错了：

>>> soup.originalEncoding
'iso-8859-1'

您可以使用响应头强制编码；此服务器设置了字符集：

>>> page.info().getparam('charset')
'utf-8'
>>> page=urllib2.urlopen(url)
>>> soup = BeautifulSoup(page.read(), fromEncoding=page.info().getparam('charset'))
>>> Item0=soup.findAll('td',{'class':'check_title'})[0]
>>> Item0.contents[0].strip()
u'Apple iPod Classic 160GB (Black)'

fromEncoding参数告诉BeautifulSoup使用UTF-8而不是拉丁语1，现在正确地剥离了非中断空间。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章