用urllib2而不是请求抓取Google Scholar

2024-10-06 11:25:56 发布

男 | 程序猿一只，喜欢编程写python代码。

我下面有一个简单的脚本，它可以很好地从googlescholar获取文章列表，搜索感兴趣的术语。你知道吗

import urllib
import urllib2
import requests
from bs4 import BeautifulSoup

SEARCH_SCHOLAR_HOST = "https://scholar.google.com"
SEARCH_SCHOLAR_URL = "/scholar"

def searchScholar(searchStr, limit=10):
    """Search Google Scholar for articles and publications containing terms of interest"""
    url = SEARCH_SCHOLAR_HOST + SEARCH_SCHOLAR_URL + "?q=" + urllib.quote_plus(searchStr) + "&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search"
    content = requests.get(url, verify=False).text
    page = BeautifulSoup(content, 'lxml')
    results = {}
    count = 0
    for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
        if count < limit:
            try:
                text = entry.a.text.encode("ascii", "ignore")
                url = entry.a['href']
                results[url] = text 
                count += 1
            except:
                pass
    return results

queryStr = "Albert einstein"
pubs = searchScholar(queryStr, 10)
if len(pubs) == 0:
    print "No articles found"
else:   
    for pub in pubs.keys():
        print pub + ' ' + pubs[pub]

但是，我希望在远程服务器上以CGI应用程序的形式运行此脚本，而不需要访问控制台，因此无法安装任何外部Python模块。（我通过将bs4目录复制到cgi-bin目录，成功地‘安装’了BeautifulSoup，而没有使用pip或easy\u install，但是这个技巧对请求不起作用，因为它有大量的依赖关系。）

所以，我的问题是：是否可以使用内置的urllib2或httplib Python模块，而不是请求获取Google Scholar页面，然后将其传递给BeautifulSoup？应该是这样的，因为我发现了一些代码here，它只使用了标准库和BeautifulSoup，就把Google Scholar刮了个精光，但它相当复杂。我更希望实现一个简单得多的解决方案，只需修改脚本以使用标准库而不是请求。你知道吗

谁能帮我一下吗？你知道吗

Tags： text import 脚本 url for search count google

1条回答

网友

1楼 · 发布于 2024-10-06 11:25:56

使用{a1}代码执行^就足够简单了：

def get(url):
    req = urllib2.Request(url)
    req.add_header('User-Agent', 'Mozilla/2.0 (compatible; MSIE 5.5; Windows NT)')
    return urllib2.urlopen(req).read()

如果你将来需要做一些更高级的事情，那就需要更多的代码。请求所做的是简化标准libs的使用。你知道吗

用urllib2而不是请求抓取Google Scholar

相关问题更多 >

编程相关推荐

热门问题

热门文章

用urllib2而不是请求抓取Google Scholar

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >