<p>一种更简单的方法是将<code>scrapy.spiders.CrawlSpider</code>类作为子类,并指定<code>rule</code>属性</p>
<pre><code>from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ConventionSpider(CrawlSpider):
name = 'convention'
allowed_domains = ['events.jspargo.com/ASCB18/Public/Exhibitors.aspx?sortMenu=102003']
start_urls = ['https://events.jspargo.com/ASCB18/Public/Exhibitors.aspx?sortMenu=102003']
rules = (
Rule(LinkExtractor(allow=('', ), # allow all links that match a given regex
deny=('')), # deny all links that match given regex
callback='parse_item', # function that gets called for each extracted link
follow=True),
)
def parse_item(self, response):
name = response.xpath('//*[@class="companyName"]')
number = response.xpath('//*[@class="boothLabel"]')
link = response.xpath('//*[@class="companyName"]')
for row, row1, row2 in zip(name, number, link):
company = row.xpath('.//*[@class="exhibitorName"]/text()').extract_first()
booth_num = row1.xpath('.//*[@class="boothLabel aa-mapIt"]/text()').extract_first()
# url = row2.xpath('.//a/@href').extract_first()
# No need to parse links because we are using CrawlSpider
yield {'Company': company,'Booth Number': booth_num}
</code></pre>
<p>但是,请确保不要使用<code>parse</code>作为回调函数,因为<code>scrapy.spiders.CrawlSpider</code>使用<code>parse</code>方法来实现其逻辑。在</p>