"Scrapy截取长字符串"

2024-10-06 10:32:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从论坛帖子中提取文本数据。这是我的工作蜘蛛:

import scrapy
import csv #not used yet


class QuotesSpider(scrapy.Spider):
    name = "quotes2"
    start_urls = [
        'https://www.motor-talk.de/faq/mercedes-e-klasse-w210-q89.html#Q3512477',
    ]

    def parse(self, response):
        xString= ' '
        xStringLink = ' '

        for i in range(4, 6): # start, stop
            xString='//*[@id="questions"]/div[2]/div['+str(i)+']/div[2]/div[1]/table/tr/td/div/text()'
            xStringLink='//*[@id="questions"]/div[2]/div['+str(i)+']/a/@name'

            scraped_info = {
                'forum post': response.xpath(xString).extract(),
                'link': response.xpath(xStringLink).extract()
            }
            yield scraped_info

但是,我只得到以下输出:如您所见,论坛帖子被切断:(:

(...)
2017-11-19 15:52:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.motor-talk.de/faq/mercedes-e-klasse-w210-q89.html#Q3512477> (referer: None)
2017-11-19 15:52:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.motor-talk.de/faq/mercedes-e-klasse-w210-q89.html>
{'forum post': ['BR210, alle Modelle mit Hersteller-Schlüssel-Nr., Typ-Schlüssel-Nr., Stückzahlen CDI Common-Rail...'], 'link': ['Q3594587']}
2017-11-19 15:52:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.motor-talk.de/faq/mercedes-e-klasse-w210-q89.html>
{'forum post': ['Als Neujahrs-Gruß nachfolgend eine Aufstellung der Fahrgestell-Indent-Nummern, sortiert nach Prod...'], 'link': ['Q5160969']}
2017-11-19 15:52:30 [scrapy.core.engine] INFO: Closing spider (finished)
(...)

实际上,这些帖子要长得多,但字符串只是被切断了。你知道吗


Tags: httpscoredivhtmlwwwde帖子mercedes