尝试在爬网时使用Scrapy写入输出

2024-06-25 22:51:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试爬网和获得所有的网页链接使用刮刮。你知道吗

当我在终端scrapy crawl crawl1 -o items.csv -t csv中这样运行它时。我确实看到它确实会爬网并获得一些如下链接,但它不会在上面提到的输出文件中写入任何内容。你知道吗

2016-12-05 16:17:33 [scrapy] DEBUG: Crawled (200) <GET http://www.abof.com/men/new-in/footwear> (referer: http://www.abof.com/)
2016-12-05 16:17:33 [scrapy] DEBUG: Crawled (200) <GET http://www.abof.com/> (referer: http://www.abof.com/)
2016-12-05 16:17:33 [scrapy] DEBUG: Crawled (200) <GET http://www.abof.com/skult> (referer: http://www.abof.com/)

我也试过了。你知道吗

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from crawl.items import CrawlItem
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import TakeFirst


class CrawlLoader(XPathItemLoader):
    default_output_processor = TakeFirst()


class MySpider(CrawlSpider):
    name = "crawl1"
    allowed_domains = ["www.abof.com"]
    start_urls = ["http://www.abof.com/"]
    #follow= True
    rules = (Rule(SgmlLinkExtractor(allow=()), callback="parse_items", ),)

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.xpath('//span[@class="pl"]')
        items = []
        l = CrawlLoader(CrawlItem(), hxs)
        for titles in titles:
            item = CrawlItem()
            # l.add_value("url",response.url)
            # l.add_xpath("title",titles.xpath("a/text()").extract())
            # l.add_xpath("link",titles.xpath("a/@href").extract()))

            item["title"] = titles.xpath("a/text()").extract()
            item["url"] = titles.xpath("a/@href").extract()
            items.append(item)
        return(items)
        # return l.load_item()

你知道吗项目.py你知道吗

import scrapy

class CrawlItem(scrapy.Item):
    # define the fields for your item here like:                                                                                                                                                            
    # name = scrapy.Field()                                                                                                                                                                                 
    title = scrapy.Field()
    url = scrapy.Field()
    pass

感谢您的帮助。你知道吗


Tags: fromimportcomhttpurlwwwitemsitem
1条回答
网友
1楼 · 发布于 2024-06-25 22:51:07

它通过改变parse\u items函数来工作。正在尝试解析javascript中的图像和其他数据。你知道吗

class CrawlLoader(XPathItemLoader):
    default_output_processor = TakeFirst()


class MySpider(CrawlSpider):
    name = "crawl1"
    allowed_domains = ["www.abof.com"]
    start_urls = ["http://www.abof.com/"]
    rules = (Rule(SgmlLinkExtractor(allow=()), callback="parse_items",follow= True ),)

    def parse_items(self, response):
        href = CrawlItem()
        href['url'] = response.url
        href["title"] = response.xpath("//title/text()").extract()
        return href

相关问题 更多 >