scrapy LinkExtractor无法提取正确的u

2015-12-15 20:38:43 [scrapy] INFO: Spider opened 2015-12-15 20:38:43 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-12-15 20:38:43 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-12-15 20:38:44 [scrapy] DEBUG: Crawled (200) <GET http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93> (referer: None) 2015-12-15 20:38:50 [scrapy] DEBUG: Crawled (404) <GET http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20http://task.zhubajie.com/success/p2.html?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93++++++++++++++++> (referer: http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93) 2015-12-15 20:38:50 [scrapy] DEBUG: Ignoring response <404 http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20http://task.zhubajie.com/success/p2.htmlkw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93++++++++++++++++>: HTTP status code is not handled or not allowed ... 2015-12-15 20:39:18 [scrapy] INFO: Closing spider (finished) 2015-12-15 20:39:18 [scrapy] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 2578, 'downloader/request_count': 6, 'downloader/request_method_count/GET': 6, 'downloader/response_bytes': 57627, 'downloader/response_count': 6, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 5, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 12, 15, 12, 39, 18, 70000), 'log_count/DEBUG': 12, 'log_count/INFO': 7, 'log_count/WARNING': 2, 'request_depth_max': 1, 'response_received_count': 6, 'scheduler/dequeued': 6, 'scheduler/dequeued/memory': 6, 'scheduler/enqueued': 6, 'scheduler/enqueued/memory': 6, 'start_time': datetime.datetime(2015, 12, 15, 12, 38, 43, 693000)} 2015-12-15 20:39:18 [scrapy] INFO: Spider closed (finished)

start_urls = [ 'http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93', ] rules = [ #Rule(LinkExtractor(allow=(r'task.zhubajie.com/success/p\d+\.html',), callback='parse_item', follow=True), Rule(LinkExtractor(restrict_xpaths=('//div[@class="pagination"]')), callback='parse_item', follow=True) ]

def process_0(value): m = re.search('http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20', value) if m: return m.strip('http://task.zhubajie.com/success/%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20')

1条回答

网友

1楼 · 发布于 2024-09-29 21:46:03

paginator中的所有链接都有很多空格http://screencloud.net/v/qQLW。在获得结果之前，您应该能够使用以下代码对废弃值进行预处理：

# coding: utf-8
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


def process_value(v):
    v1 = v.split()[-1]
    if v1.startswith('http'):
        v = v1
    return v


class MySpider(CrawlSpider):
    name = 'spider'
    start_urls = [
        'http://task.zhubajie.com/success/?kw=%E7%99%BE%E5%BA%A6%E7%9F%A5%E9%81%93'
    ]
    rules = [
        Rule(LinkExtractor(restrict_xpaths=('//div[@class="pagination"]'),
                           process_value=process_value), follow=True)
    ]

卡盘输出：

^{pr2}$

^{} docs

相关问题更多 >

编程相关推荐

热门问题

热门文章