将Excel.csv作为启动导入

2024-10-03 23:28:07 发布

您现在位置:Python中文网/ 问答频道 /正文

因此,我正在构建一个scraper,它导入一个.csv excel文件,其中有一行大约2400个网站(每个网站都在自己的专栏中),并使用这些作为起始网址。我说的不是一个字符串,而是一个错误。我认为这可能是因为我的列表中基本上只有一个非常长的列表来表示行。如何克服这一点,基本上把每个网站从我的.csv作为自己的独立字符串在列表中?在

raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
    exceptions.TypeError: Request url must be str or unicode, got list:


import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
from tutorial.items import DanishItem
from scrapy.http import Request
import csv

with open('websites.csv', 'rbU') as csv_file:
  data = csv.reader(csv_file)
  scrapurls = []
  for row in data:
    scrapurls.append(row)

class DanishSpider(scrapy.Spider):
  name = "dmoz"
  allowed_domains = []
  start_urls = scrapurls

  def parse(self, response):
    for sel in response.xpath('//link[@rel="icon" or @rel="shortcut icon"]'):
      item = DanishItem()
      item['website'] = response
      item['favicon'] = sel.xpath('./@href').extract()
      yield item

谢谢!在

乔伊


Tags: orcsv字符串fromimporturl列表网站
3条回答
  for row in data:
    scrapurls.append(row)

row是一个列表[column1,column2,…] 所以我认为您需要提取列,并附加到您的起始URL。在

^{pr2}$

仅仅为start_urls生成一个列表是不起作用的,因为它清楚地写在Scrapy documentation中。在

根据文档:

You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

我宁愿这样做:

def get_urls_from_csv():
    with open('websites.csv', 'rbU') as csv_file:
        data = csv.reader(csv_file)
        scrapurls = []
        for row in data:
            scrapurls.append(row)
        return scrapurls


class DanishSpider(scrapy.Spider):

    ...

    def start_requests(self):
        return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]

尝试在类内部打开.csv文件(而不是像以前那样在外部打开),并附加起始URL。这个解决方案对我有效。希望这有帮助:-)

    class DanishSpider(scrapy.Spider):
        name = "dmoz"
        allowed_domains = []
        start_urls = []

        f = open('websites.csv'), 'r')
        for i in f:
        u = i.split('\n')
        start_urls.append(u[0])

相关问题 更多 >