http响应流中的Python-seek问题的回答

http响应流中的Python-seek

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<p>我不知道C实现是如何工作的，但是，由于internet流通常是不可查看的，我猜它会将所有数据下载到本地文件或内存对象中并从中查找。与此相当的Python将按照Abafei的建议执行，并将数据写入文件或StringIO并从中查找。</p> <p>然而，如果您对Abafei的回答的评论表明，您只想检索文件的某个特定部分（而不是通过返回的数据来向后和向前查找），那么还有另一种可能。<code>urllib2</code>可用于检索网页的特定部分（或HTTP术语中的“范围”），前提是服务器支持此行为。</p> <h2><code>range</code>头</h2> <p>当您向服务器发送请求时，请求的参数以不同的头给出。其中之一是<code>Range</code>头，在<a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35">section 14.35 of RFC2616</a>（定义HTTP/1.1的规范）中定义。此头允许您执行以下操作：检索从第10000字节开始的所有数据，或检索字节1000到1500之间的数据。</p> <h2>服务器支持</h2> <p>服务器不需要支持范围检索。一些服务器将返回<code>Accept-Ranges</code>头（<a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.5">section 14.5 of RFC2616</a>），并返回一个响应以报告它们是否支持范围。这可以使用HEAD请求来检查。但是，并不需要这样做；如果服务器不支持范围，它将返回整个页面，然后我们可以像以前一样在Python中提取所需的数据部分。</p> <h2>检查是否返回范围</h2> <p>如果服务器返回一个范围，它必须将<code>Content-Range</code>头（<a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.16">section 14.16 of RFC2616</a>）与响应一起发送。如果这出现在响应的头中，我们知道返回了一个范围；如果不存在，则返回整个页面。</p> <h2>使用urllib2实现</h2> <p><code>urllib2</code>允许我们向请求添加头，从而允许我们向服务器请求范围而不是整个页面。以下脚本在命令行中获取URL、起始位置和（可选）长度，并尝试检索页面的给定部分。</p> <pre><code>import sys import urllib2 # Check command line arguments. if len(sys.argv) < 3: sys.stderr.write("Usage: %s url start [length]\n" % sys.argv[0]) sys.exit(1) # Create a request for the given URL. request = urllib2.Request(sys.argv[1]) # Add the header to specify the range to download. if len(sys.argv) > 3: start, length = map(int, sys.argv[2:]) request.add_header("range", "bytes=%d-%d" % (start, start + length - 1)) else: request.add_header("range", "bytes=%s-" % sys.argv[2]) # Try to get the response. This will raise a urllib2.URLError if there is a # problem (e.g., invalid URL). response = urllib2.urlopen(request) # If a content-range header is present, partial retrieval worked. if "content-range" in response.headers: print "Partial retrieval successful." # The header contains the string 'bytes', followed by a space, then the # range in the format 'start-end', followed by a slash and then the total # size of the page (or an asterix if the total size is unknown). Lets get # the range and total size from this. range, total = response.headers['content-range'].split(' ')[-1].split('/') # Print a message giving the range information. if total == '*': print "Bytes %s of an unknown total were retrieved." % range else: print "Bytes %s of a total of %s were retrieved." % (range, total) # No header, so partial retrieval was unsuccessful. else: print "Unable to use partial retrieval." # And for good measure, lets check how much data we downloaded. data = response.read() print "Retrieved data size: %d bytes" % len(data) </code></pre> <p>使用这个，我可以检索Python主页的最后2000字节：</p> <pre><code>blair@blair-eeepc:~$ python retrieverange.py http://www.python.org/ 17387 Partial retrieval successful. Bytes 17387-19386 of a total of 19387 were retrieved. Retrieved data size: 2000 bytes </code></pre> <p>或距主页中间400字节：</p> <pre><code>blair@blair-eeepc:~$ python retrieverange.py http://www.python.org/ 6000 400 Partial retrieval successful. Bytes 6000-6399 of a total of 19387 were retrieved. Retrieved data size: 400 bytes </code></pre> <p>但是，谷歌主页不支持范围：</p> <pre><code>blair@blair-eeepc:~$ python retrieverange.py http://www.google.com/ 1000 500 Unable to use partial retrieval. Retrieved data size: 9621 bytes </code></pre> <p>在这种情况下，有必要在进一步处理之前提取Python中感兴趣的数据。</p>

http响应流中的Python-seek

1 个回答

相关Python问题