httplib没有得到所有的重定向代码

Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53) [GCC 4.5.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import httplib >>> from urlparse import urlparse >>> url = 'http://www.usmc.mil/units/hqmc/' >>> host = urlparse(url)[1] >>> req = ''.join(urlparse(url)[2:5]) >>> conn = httplib.HTTPConnection(host) >>> conn.request('HEAD', req) >>> resp = conn.getresponse() >>> print resp.status 301 >>> print resp.msg.dict['location'] http://www.marines.mil/units/hqmc/ >>> url = 'http://www.marines.mil/units/hqmc/' >>> host = urlparse(url)[1] >>> req = ''.join(urlparse(url)[2:5]) >>> conn = httplib.HTTPConnection(host) >>> conn.request('HEAD', req) >>> resp = conn.getresponse() >>> print resp.status 302 >>> print resp.msg.dict['location'] http://www.marines.mil/units/hqmc/default.aspx >>> url = 'http://www.marines.mil/units/hqmc/default.aspx' >>> host = urlparse(url)[1] >>> req = ''.join(urlparse(url)[2:5]) >>> conn = httplib.HTTPConnection(host) >>> conn.request('HEAD', req) >>> resp = conn.getresponse() >>> print resp.status 200 >>> print resp.msg.dict['location'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'location' >>> print url http://www.marines.mil/units/hqmc/default.aspx //THIS URL DOES NOT RETURN A 200 IN ANY BROWSER I HAVE TRIED

2条回答

网友

1楼 · 编辑于 2024-09-29 06:33:03

您可以使用HttpLib2获取URL的实际位置：

import httplib2

def getContentLocation(link):
    h = httplib2.Http(".cache_httplib")
    h.follow_all_redirects = True
    resp = h.request(link, "GET")[0]
    contentLocation = resp['content-location']
    return contentLocation

if __name__ == '__main__':
    link = 'http://podcast.at/podcast_url344476.html'
    print getContentLocation(link)

执行过程如下：

$ python2.7 getContentLocation.py
http://keyinvest.podcaster.de/8uhr30.rss

注意这个例子也使用缓存（urllib和httplib都不支持缓存）。因此，这将反复明显加快运行速度。这可能对爬行/抓取很有意思。如果不需要缓存，请将h = httplib2.Http(".cache_httplib")替换为h = httplib2.Http()。

网友

2楼 · 编辑于 2024-09-29 06:33:03

您可以尝试将用户代理头设置为浏览器的用户代理。

附言： urllib2自动重定向

编辑：

In [2]: import urllib2
In [3]: resp = urllib2.urlopen('http://www.usmc.mil/units/hqmc/')
In [4]: resp.geturl()
Out[4]: 'http://www.marines.mil/units/hqmc/default.aspx

相关问题更多 >

编程相关推荐

热门问题

热门文章