带正确扩展名的scrapy下载

2024-06-17 16:05:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一只蜘蛛:

class Downloader(scrapy.Spider):
    name = "sor_spider"
    download_folder = FOLDER

    def get_links(self):
        df = pd.read_excel(LIST)
        return df["Value"].loc

    def start_requests(self):
        urls = self.get_links()
        for url in urls.iteritems():
            index = {"index" : url[0]}
            yield scrapy.Request(url=url[1], callback=self.download_file, errback=self.errback_httpbin, meta=index, dont_filter=True)

    def download_file(self, response):
        url = response.url
        index = response.meta["index"]
        content_type = response.headers['Content-Type']

        download_path = os.path.join(self.download_folder, r"{}".format(str(index)))

        with open(download_path, "wb") as f:
            f.write(response.body)

        yield LinkCheckerItem(index=response.meta["index"], url=url, code="downloaded")


    def errback_httpbin(self, failure):
        yield LinkCheckerItem(index=failure.request.meta["index"], url=failure.request.url, code="error")

它应该:

  1. 阅读带有链接的excel(LIST
  2. 转到每个链接并将文件下载到FOLDER
  3. LinkCheckerItem中记录结果(我正在将其导出到csv)

这通常可以正常工作,但我的列表包含不同类型的文件-压缩,pdf,文档等

以下是myLIST中的链接示例:

https://disclosure.1prime.ru/Portal/GetDocument.aspx?emId=7805019624&docId=2c5fb68702294531afd03041e877ca84
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1173293
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1263289
https://disclosure.1prime.ru/Portal/GetDocument.aspx?emId=7805019624&docId=eb9f06d2b837401eba9c66c8bf5be813
http://e-disclosure.ru/portal/FileLoad.ashx?Fileid=952317
http://e-disclosure.ru/portal/FileLoad.ashx?Fileid=1042224
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1160005
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=925955
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1166563
http://npoimpuls.ru/templates/npoimpuls/material/documents/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA%20%D0%B0%D1%84%D1%84%D0%B8%D0%BB%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%BD%D1%8B%D1%85%20%D0%BB%D0%B8%D1%86%20%D0%BD%D0%B0%2030.06.2016.pdf
http://нпоимпульс.рф/assets/download/sal30.09.2017.pdf
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1166287

我想它保存文件的原始扩展名,无论它是什么。。。就像我的浏览器打开保存文件的警报一样。你知道吗

我试图使用response.headers["Content-type"]来找出类型,但在这种情况下总是application/octet-stream。你知道吗

我怎么能做到?你知道吗


Tags: httpsselfhttpurlindexresponsedownloaddef