我的代码在下面。我想把结果提取到CSV。然而,scrapy的结果是一个有2个键的字典,所有的值都集中在每个键中。输出看起来不好。 我怎么解决这个问题。这可以通过管道/项目加载器等来实现吗。。。在
非常感谢。在
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join
from gumtree1.items import GumtreeItems
class AdItemLoader(ItemLoader):
jobs_in = MapCompose(unicode.strip)
class GumtreeEasySpider(CrawlSpider):
name = 'gumtree_easy'
allowed_domains = ['gumtree.com.au']
start_urls = ['http://www.gumtree.com.au/s-jobs/page-2/c9302?ad=offering']
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[@class="rs-paginator-btn next"]'), callback='parse_item', follow=True),
)
def parse_item(self, response):
loader = AdItemLoader(item=GumtreeItems(), response=response)
loader.add_xpath('jobs','//div[@id="recent-sr-title"]/following-sibling::*//*[@itemprop="name"]/text()')
loader.add_xpath('location', '//div[@id="recent-sr-title"]/following-sibling::*//*[@class="rs-ad-location-area"]/text()')
yield loader.load_item()
结果是:
^{pr2}$应该是这样吗。工作和位置是个人决定的吗?这可以正确地将作业和位置写入CSV,但我发现使用for循环和zip不是最好的方法。在
import scrapy
from gumtree1.items import GumtreeItems
class AussieGum1Spider(scrapy.Spider):
name = "aussie_gum1"
allowed_domains = ["gumtree.com.au"]
start_urls = (
'http://www.gumtree.com.au/s-jobs/page-2/c9302?ad=offering',
)
def parse(self, response):
item = GumtreeItems()
jobs = response.xpath('//div[@id="recent-sr-title"]/following-sibling::*//*[@itemprop="name"]/text()').extract()
location = response.xpath('//div[@id="recent-sr-title"]/following-sibling::*//*[@class="rs-ad-location-area"]/text()').extract()
for j, l in zip(jobs, location):
item['jobs'] = j.strip()
item['location'] = l
yield item
部分结果如下。在
2016-03-16 02:20:46 [scrapy] DEBUG: Crawled (200) <GET http://www.gumtree.com.au/s-jobs/page-3/c9302?ad=offering> (referer: http://www.gumtree.com.au/s-jobs/page-2/c9302?ad=offering)
2016-03-16 02:20:46 [scrapy] DEBUG: Scraped from <200 http://www.gumtree.com.au/s-jobs/page-3/c9302?ad=offering>
{'jobs': u'Live In Au pair-Urgent', 'location': u'Wanneroo Area'}
2016-03-16 02:20:46 [scrapy] DEBUG: Scraped from <200 http://www.gumtree.com.au/s-jobs/page-3/c9302?ad=offering>
{'jobs': u'live in carer', 'location': u'Fraser Coast'}
2016-03-16 02:20:46 [scrapy] DEBUG: Scraped from <200 http://www.gumtree.com.au/s-jobs/page-3/c9302?ad=offering>
{'jobs': u'Mental Health Nurse', 'location': u'Perth Region'}
2016-03-16 02:20:46 [scrapy] DEBUG: Scraped from <200 http://www.gumtree.com.au/s-jobs/page-3/c9302?ad=offering>
{'jobs': u'Experienced NBN pit and pipe installers/node and cabinet wor...',
'location': u'Marrickville Area'}
2016-03-16 02:20:46 [scrapy] DEBUG: Scraped from <200 http://www.gumtree.com.au/s-jobs/page-3/c9302?ad=offering>
{'jobs': u'Delivery Driver / Pizza Maker Job - Dominos Pizza',
'location': u'Hurstville Area'}
非常感谢。在
为每个项都有一个父选择器,并提取与其相关的}:
job
和{老实说,使用for循环是正确的方法,但您可以在管道上解决它:
同时添加自定义项:
^{pr2}$希望这有帮助,再次我认为你应该使用循环。在
相关问题 更多 >
编程相关推荐