Python从googlealerts提要获取链接的重定向url

2024-07-01 08:38:54 发布

您现在位置:Python中文网/ 问答频道 /正文

如果您将google警报创建为rss提要(不是自动发送到您的电子邮件地址),它包含如下链接:https://www.google.com/url?rct=j&sa=t&url=http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/&ct=ga&cd=CAIyGjkyZjE1NGUzMGIwZjRkNGQ6Y29tOmVuOlVT&usg=AFQjCNHrCLmbml7baTXaqySagcuKHp-KHA。在

这个链接显然是一个重定向(只要尝试一下,你就会在这里结束:http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/),但是我不能用Python获得这个最终的url(除非删除url的开头,这很难看)。在

到目前为止,我尝试过urllib2、httplib2和requests包:

  • 返回值中的urllib2.urlopen和geturl()
  • httplib2请求,follow_all_redirects=True且返回值为“content location”
  • 在请求.get从返回值看历史

有人已经面对过这个问题吗? 谢谢!在


Tags: comhttpurllabels链接wwwgooglepeople
1条回答
网友
1楼 · 发布于 2024-07-01 08:38:54

Google不会给你一个HTTP重定向;返回一个200ok响应,而不是30x重定向:

>>> import requests
>>> url = 'https://www.google.com/url?rct=j&sa=t&url=http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/&ct=ga&cd=CAIyGjkyZjE1NGUzMGIwZjRkNGQ6Y29tOmVuOlVT&usg=AFQjCNHrCLmbml7baTXaqySagcuKHp-KHA'
>>> response = requests.get(url)
>>> response.url
u'https://www.google.com/url?rct=j&sa=t&url=http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/&ct=ga&cd=CAIyGjkyZjE1NGUzMGIwZjRkNGQ6Y29tOmVuOlVT&usg=AFQjCNHrCLmbml7baTXaqySagcuKHp-KHA'
>>> response.text
u'<script>window.googleJavaScriptRedirect=1</script><script>var m={navigateTo:function(b,a,d){if(b!=a&&b.google){if(b.google.r){b.google.r=0;b.location.href=d;a.location.replace("about:blank");}}else{a.location.replace(d);}}};m.navigateTo(window.parent,window,"http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/");\n</script><noscript><META http-equiv="refresh" content="0;URL=\'http://www.statesmanjournal.com/story/opinion/readers/2014/10/13/gmo-labels-encourage-people-make-choices/17171289/\'"></noscript>'

响应是一段HTML和JavaScript,您的浏览器会将其解释为加载一个新的URL。您需要解析该响应来提取目标。在

字符串拆分可以实现:

^{pr2}$

如果我们假设主体中的URL参数只是查询字符串中url参数的直接反映,那么您也可以从那里提取它,甚至不必要求Google执行重定向:

try:
    from urllib.parse import parse_qs, urlsplit
except ImportError:
    # Python 2
    from urlparse import parse_qs, urlsplit

target = parse_qs(urlsplit(url).query)['url'][0]

相关问题 更多 >

    热门问题