如何用BeautifulSoup将UTF-8编码的HTML正确解析为Unicode字符串？

import urllib2 from BeautifulSoup import BeautifulSoup # Fetch URL url = 'http://www.voxnow.de/' request = urllib2.Request(url) request.add_header('Accept-Encoding', 'utf-8') # Response has UTF-8 charset header, # and HTML body which is UTF-8 encoded response = urllib2.urlopen(request) # Parse with BeautifulSoup soup = BeautifulSoup(response) # Print title attribute of a <div> which uses umlauts (e.g. können) print repr(soup.find('div', id='navbutton_account')['title'])

2条回答

网友

1楼 · 编辑于 2024-09-30 20:34:48

正如justhalf在上面指出的，我的问题本质上是this question的一个副本。

HTML内容报告自己是UTF-8编码的，在大多数情况下，它是，除了一个或两个恶意的无效UTF-8字符。

这显然混淆了BeautifulSoup使用的是哪种编码，以及当试图在将内容传递给BeautifulSoup时首先解码为UTF-8时这：

soup = BeautifulSoup(response.read().decode('utf-8'))

我会得到错误：

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813: 
                    invalid continuation byte

更仔细地查看输出，有一个字符Ü的实例被错误地编码为无效字节序列0xe3 0x9c，而不是正确的^{}。

正如目前关于这个问题的highest-rated answer所表明的，在解析时可以删除无效的UTF-8字符，这样只有有效的数据才能传递给BeautifulSoup：

soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

网友

2楼 · 编辑于 2024-09-30 20:34:48

将结果编码为utf-8似乎对我有效：

print (soup.find('div', id='navbutton_account')['title']).encode('utf-8')

它产生：

Hier kÃ¶nnen Sie sich kostenlos registrieren und / oder einloggen!

相关问题更多 >

编程相关推荐

热门问题

热门文章