在Python 2.7.10中爬取网页时，Unicode字符被替换为问号

response = urllib2.urlopen(url) charset = response.headers.getheader("Content-Type") charset = charset[charset.index("charset=") + 8:] html = response.read() html = " ".join(html.split()) html = html.decode(charset) html = html.replace("amp;", "").replace("'", "'")

2条回答

网友

1楼 · 编辑于 2024-10-03 00:16:30

在第一个链接上使用请求和靓汤。在

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

url = "http://www.nzqa.govt.nz/ncea/assessment/search.do?     query=reo+maori&view=all&level=01"
headers= {"User-Agent":"Mozilla/5.0"}
r = requests.get(url, headers=headers)
# print(r.content)
print(r.encoding)
print(r.headers['content-type'])
data = r.text
data1 = data.encode('UTF-8')
soup = BeautifulSoup(data1)
text = soup.get_text()
text2 = text.encode('utf-8', 'ignore')
# text2 = text.encode('ascii', 'ignore')
print(text2)

哪一行取决于你下一步要做什么。在

注意使用Anand建议的标题

网友

2楼 · 编辑于 2024-10-03 00:16:30

当服务器无法识别请求来自的用户代理时，服务器似乎用错误的内容类型响应。当我在我的机器上尝试时，得到了类似的结果。在

在将有效的User-Agent添加到请求头之后，我能够正确地获得响应的utf-8编码。我不确定这是否是解决这个问题的最佳方法，但它应该能让您的代码正常工作。示例-

req = urllib2.Request(url, headers = {"Connection":"keep-alive", "User-Agent":"Mozilla/5.0"})
response = urllib2.urlopen(req)
#After this rest of your original code.

相关问题更多 >

编程相关推荐

热门问题

热门文章