刮痧不会刮伤粗体

2024-09-29 06:22:05 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经为这只蜘蛛工作了好几个月了,也遇到了同样的问题——有人能帮我吗?

在上面提到的网站上(见下文),除了粗体的“型号名称”之外,所有的仪器数据都被删除了。这让人恼火,我不知道该怎么做。

import re
import json
from urlparse import urlparse


from scrapy.selector import Selector
try:
    from scrapy.spider import Spider
except:
    from scrapy.spider import BaseSpider as Spider
from scrapy.utils.response import get_base_url
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from database.items import databaseItem

from scrapy.log import *

class CommonSpider(CrawlSpider):
    name = 'brands.py'
    allowed_domains = ['usedprice.com']
    start_urls = ['http://www.usedprice.com/items/guitars-musical-instruments/index.html']

    rules = (

        Rule(LinkExtractor(allow=( )), callback='parse_item'),
    )


    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        item = databaseItem()
        datao = datao = hxs.xpath('//tr[@class="oddItemColor baseText"]')
        datae = datae = hxs.xpath('//tr[@class="evenItemColor baseText"]')
        tmpNextPage = hxs.xpath('//div[@class="baseText blue"]/span[@id="pnLink"]/a/@href').extract()
        for attr in datao:
            *modelInfo = attr.xpath('.//b/text()').extract()*
            instrInfo = attr.xpathxpath('.//td//text()').extract()
            item['modelInfo'].append = modelInfo
            item['instrInfo'].append = instrInfo
            return databaseItem(modelInfo = modelInfo[1:], instrInfo = instrInfo[2:])
        for attr in datae:
            *modelInfo = attr.xpath('.//b/text()').extract()*
            instrInfo = attr.xpath('.//td//text()').extract()
            item['modelInfo'].append = modelInfo
            item['instrInfo'].append = instrInfo
            return databaseItem(modelInfo = modelInfo[1:], instrInfo = instrInfo[2:])

Tags: textfromimportextractitemxpathdatabaseclass
1条回答
网友
1楼 · 发布于 2024-09-29 06:22:05

你应该:

  • 使用精确的数组索引(而不是切片)来提取modelInfo字段
    • [0][1:] == []
  • yield每个databaseItem而不是返回它
  • 添加一些逻辑来检测粗体文本是否是您想要的
    • 感兴趣的表中的第二行以粗体显示“Description”,而“Type”信息没有“column”

(我没有运行您的代码,但我确实查看了http://www.usedprice.com/items/guitars-musical-instruments/a-basses/index.html

相关问题 更多 >