将Excel.csv作为启动导入

raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__) exceptions.TypeError: Request url must be str or unicode, got list: import scrapy from scrapy.selector import HtmlXPathSelector from scrapy.http import HtmlResponse from tutorial.items import DanishItem from scrapy.http import Request import csv with open('websites.csv', 'rbU') as csv_file: data = csv.reader(csv_file) scrapurls = [] for row in data: scrapurls.append(row) class DanishSpider(scrapy.Spider): name = "dmoz" allowed_domains = [] start_urls = scrapurls def parse(self, response): for sel in response.xpath('//link[@rel="icon" or @rel="shortcut icon"]'): item = DanishItem() item['website'] = response item['favicon'] = sel.xpath('./@href').extract() yield item

3条回答

网友

1楼 · 编辑于 2024-10-03 23:28:07

  for row in data:
    scrapurls.append(row)

row是一个列表[column1，column2，…] 所以我认为您需要提取列，并附加到您的起始URL。在

^{pr2}$

网友

2楼 · 编辑于 2024-10-03 23:28:07

仅仅为start_urls生成一个列表是不起作用的，因为它清楚地写在Scrapy documentation中。在

根据文档：

You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.
The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

我宁愿这样做：

def get_urls_from_csv():
    with open('websites.csv', 'rbU') as csv_file:
        data = csv.reader(csv_file)
        scrapurls = []
        for row in data:
            scrapurls.append(row)
        return scrapurls


class DanishSpider(scrapy.Spider):

    ...

    def start_requests(self):
        return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]

网友

3楼 · 编辑于 2024-10-03 23:28:07

尝试在类内部打开.csv文件（而不是像以前那样在外部打开），并附加起始URL。这个解决方案对我有效。希望这有帮助：-）

    class DanishSpider(scrapy.Spider):
        name = "dmoz"
        allowed_domains = []
        start_urls = []

        f = open('websites.csv'), 'r')
        for i in f:
        u = i.split('\n')
        start_urls.append(u[0])

相关问题更多 >

编程相关推荐

热门问题

热门文章