UnicodeDecodeError使用Python下载HTML

#coding: utf-8 import urllib.request import re url = 'http://learnvimscriptthehardway.stevelosh.com' name = '/chapters/16.html' while(len(name) != 0): url1 = url + name print(url1) response = urllib.request.urlopen(url1) vim = response.read().decode('utf-8') address = "/Users/zhangzhimin/learnvimthehardway/" + name[-2:] + ".html" with open(address, "w") as f: f.write(vim) print("%s finish" % name) x = re.findall('''<a class="next" href="(.+?)"''', vim) name = x[0]

:!python3 test.py http://learnvimscriptthehardway.stevelosh.com/chapters/16.html /chapters/16.html finish http://learnvimscriptthehardway.stevelosh.com/chapters/17.html Traceback (most recent call last): File "test.py", line 11, in <module> vim = response.read().decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

2条回答

网友

1楼 · 编辑于 2024-09-28 21:07:06

最后我解决了这个问题，事实上，除了考虑到gzip，我应该认为提醒我：

byte 0x8b in position 1 usually signals that the data stream is gzipped.

在代码中使用gzip模块后，一切正常。你知道吗

网友

2楼 · 编辑于 2024-09-28 21:07:06

请参阅有效的示例：

import urllib2
import re

name = '/chapters/16.html'
url = 'http://learnvimscriptthehardway.stevelosh.com'
while len(name) > 0:
    url1 = url + name
    response = urllib2.urlopen(url1)
    data = response.read()
    address = './vim/' + name[-7:]
    with open(address, 'w') as fh:
        fh.write(data)
    x = re.findall('''<a class="next" href="(.+?)"''', data)
    if x:
        name = x[0]
    else:
        break

不过，我使用的是python2.7.10。这段代码从您指定的url下载html格式的所有章节。注意：替换目录的“./vim/”（当前目录+vim）；我使用了name[-7:]，它是最后7个字符，如“16.html”等等。条件“if”（if x:…）排除“索引超出范围”错误。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章