Python beauthulsoup html解析处理gbk编码不好的中文网页垃圾问题

# -*- coding: utf8 -*- import codecs from BeautifulSoup import BeautifulSoup, NavigableString, UnicodeDammit import urllib2,sys import time try: import timeoutsocket # http://www.timo-tasi.org/python/timeoutsocket.py timeoutsocket.setDefaultSocketTimeout(10) except ImportError: pass h=u'\u3000\u3000\u4fe1\u606f\u901a\u4fe1\u6280\u672f' address=urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read() soup=BeautifulSoup(address) p=soup.findAll('p') t=p[2].string[:10]

1条回答

网友

1楼 · 发布于 2024-09-27 01:23:55

该文件的meta标记声明字符集是GB2312，但数据中包含一个来自新的GBK/GB18030的字符，这就是导致beautifulGroup失败的原因：

simon@lucifer:~$ python
Python 2.7 (r27:82508, Jul  3 2010, 21:12:11) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> data = urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read()
>>> data.decode("gb2312")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 20148-20149: illegal multibyte sequence

此时，UnicodeDammit退出，尝试chardet，UTF-8，最后Windows-1252，这总是成功的——从外观上看，这就是你得到的。在

如果我们告诉解码器用“？”替换未识别的字符，我们可以看到GB2312中缺少的字符：

^{pr2}$

使用正确的编码：

>>> print data[20140:20160].decode("gb18030", "replace")
毒尾气二噁英的排放难
>>> from BeautifulSoup import BeautifulSoup
>>> s = BeautifulSoup(data, fromEncoding="gb18030")
>>> print s.findAll("p")[2].string[:10]
　　信息通信技术是&

同时：

>>> print s.findAll("p")[2].string
　　信息通信技术是&ldquo;十二五&rdquo;规划重点发展方向，行业具有很强的内在增长潜
力，增速远高于GDP。软件外包、服务外包、管理软件、车载导航、网上购物、网络游戏、
移动办公、移动网络游戏、网络视频等均存在很强的潜在需求，使信息技术行业继续保持较
高增长。

相关问题更多 >

编程相关推荐

热门问题

热门文章