从sciencedi自动下载

2024-06-26 01:36:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试自动从science direct下载文章 例如:

url = 'http://www.sciencedirect.com/science/article/pii/S1053811913010240'

我可以用我的浏览器毫无问题地访问这些文章,但是我尝试过使用Python的requestsurllib2和{}模块,但没有成功。因为我需要下载许多文章,所以手动下载不是一种选择。在

Wget也不起作用。在

例如

^{pr2}$

退货:

HTTP request sent, awaiting response... 404 Not Found

有什么问题吗?在


Tags: 模块comhttpurlwwwarticle文章浏览器
2条回答

它们可能无法工作,因为web服务器不喜欢用户代理。也许是想阻止批量下载。在

如果使用wget指定一个用户代理,则它可以工作。用你的例子。在

wget -U "Mozilla/5.0" "https://www.sciencedirect.com/science/article/pii/S1053811913010240"

这里有一些代码是我从pyscholar修改的。在

#!/usr/bin/python
#author: Bryan Bishop <kanzure@gmail.com>
#date: 2010-03-03
#purpose: given a link on the command line to sciencedirect.com, download the associated PDF and put it in "sciencedirect.pdf" or something
import os
import re
import pycurl
#from BeautifulSoup import BeautifulSoup
from lxml import etree
import lxml.html
from StringIO import StringIO
from string import join, split

user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.5) Gecko/20091123 Iceweasel/3.5.5 (like Firefox/3.5.5; Debian-3.5.5-1)"

def interscience(url):
    '''downloads the PDF from sciencedirect given a link to an article'''
    url = str(url)
    buffer = StringIO()

    curl = pycurl.Curl()
    curl.setopt(curl.URL, url)
    curl.setopt(curl.WRITEFUNCTION, buffer.write)
    curl.setopt(curl.VERBOSE, 0)
    curl.setopt(curl.USERAGENT, user_agent)
    curl.setopt(curl.TIMEOUT, 20)
    curl.perform()
    curl.close()

    buffer = buffer.getvalue().strip()
    html = lxml.html.parse(StringIO(buffer))

    pdf_href = []
    for item in html.getroot().iter('a'):
        if (('id' in item.attrib) and  ('href' in item.attrib) and item.attrib['id']=='pdfLink'):
            pdf_href.append(item.attrib['href'])


    pdf_href = pdf_href[0]
    #now let's get the article title

    title_div = html.find("head/title")
    paper_title = title_div.text
    paper_title = paper_title.replace("\n", "")
    if paper_title[-1] == " ": paper_title = paper_title[:-1]
    re.sub('[^a-zA-Z0-9_\-.() ]+', '', paper_title)
    paper_title = paper_title.strip()
    paper_title = re.sub(' ','_',paper_title)

    #now fetch the document for the user
    command = "wget  user-agent=\"pyscholar/blah\"  output-document=\"%s.pdf\" \"%s\"" % (paper_title, pdf_href)
    os.system(command)
    print "\n\n"

interscience("http://www.sciencedirect.com/science/article/pii/S0163638307000628")

相关问题 更多 >