擅长:python、mysql、java
<p>仅仅为<code>start_urls</code>生成一个列表是不起作用的,因为它清楚地写在<a href="http://doc.scrapy.org/en/latest/topics/spiders.html" rel="nofollow">Scrapy documentation</a>中。在</p>
<p>根据文档:</p>
<blockquote>
<p>You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.</p>
<p>The first requests to perform are obtained by calling the
<code>start_requests()</code> method which (by default) generates <code>Request</code> for
the URLs specified in the <code>start_urls</code> and the <code>parse</code> method as
callback function for the Requests.</p>
</blockquote>
<p>我宁愿这样做:</p>
<pre><code>def get_urls_from_csv():
with open('websites.csv', 'rbU') as csv_file:
data = csv.reader(csv_file)
scrapurls = []
for row in data:
scrapurls.append(row)
return scrapurls
class DanishSpider(scrapy.Spider):
...
def start_requests(self):
return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]
</code></pre>