不同数量的url返回

class MySpider(CrawlSpider): name = "xyz" allowed_domains = ["xyz.nl"] start_urls = ["http://www.xyz.nl/Vacancies"] rules = (Rule(SgmlLinkExtractor(allow=[r'\/V-\d{7}\/[\w\S]+']), callback='parse_item'),Rule(SgmlLinkExtractor(allow=[r'\?page\=\d+\&sortCriteria\=1']), follow=True),) def parse_item(self, response): outputfile = open('urllist.txt','a') print response.url outputfile.write(response.url+'\n')

1条回答

网友

1楼 · 发布于 2024-10-04 05:31:00

不要在parse_item()方法中手动编写链接并用a模式打开文件，而是使用scrapy的内置item exporters。使用链接字段定义项目：

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field


class MyItem(Item):
    url = Field()


class MySpider(CrawlSpider):
    name = "xyz"
    allowed_domains = ["xyz.nl"]
    start_urls = ["http://www.xyz.nl/Vacancies"]
    rules = (Rule(SgmlLinkExtractor(allow=[r'\/V-\d{7}\/[\w\S]+']), callback='parse_item'),
             Rule(SgmlLinkExtractor(allow=[r'\?page\=\d+\&sortCriteria\=1']), follow=True),)

    def parse_item(self, response):
        item = MyItem()
        item['url'] = response.url
        yield item

相关问题更多 >

编程相关推荐

热门问题

热门文章