为什么我的蜘蛛没有进入下一页?

2024-10-06 10:29:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我的蜘蛛没有水平爬行,我也不知道为什么

parse_item函数在第一页上运行得很好。我在scrapy shell中检查了next_page的xpath,它是正确的

你能查一下我的密码吗

我想浏览的网站是this

import scrapy
import datetime
import socket

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from itemloaders.processors import MapCompose
from properties.items import PropertiesItem


class EasySpider(CrawlSpider):
    name = 'easy'
    allowed_domains = ['www.vivareal.com.br']
    start_urls = ['https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/']

    next_page = '//li[@class="pagination__item"][last()]'

    rules = (
        Rule(LinkExtractor(restrict_xpaths=next_page)),
        Rule(LinkExtractor(allow=r'/imovel/', 
                            deny=r'/imoveis-lancamento/'),
                            callback='parse_item'),
    )

    def parse_item(self, response):
        l = ItemLoader(item=PropertiesItem(), response=response)
        l.add_xpath('url', 'a/@href', )
        l.add_xpath('tipo', '//h1/text()',
                    MapCompose(lambda x: x.strip().split()[0]))
        l.add_xpath('valor', '//h3[@class="price__price-info js-price-sale"]/text()',
                    MapCompose(lambda x: x.strip().replace('R$ ', '').replace('.', ''), float))
        l.add_xpath('condominio', '//span[@class="price__list-value condominium js-condominium"]/text()',
                    MapCompose(lambda x: x.strip().replace('R$ ', '').replace('.', ''), float))
        l.add_xpath('endereco', '//p[@class="title__address js-address"]/text()',
                    MapCompose(lambda x: x.split(' - ')[0]))
        l.add_xpath('bairro', '//p[@class="title__address js-address"]/text()',
                    MapCompose(lambda x: x.split(' - ')[1].split(',')[0]))
        l.add_xpath('quartos', '//ul[@class="features"]/li[@title="Quartos"]/span/text()',
                    MapCompose(lambda x: x.strip(), int))
        l.add_xpath('banheiros', '//ul[@class="features"]/li[@title="Banheiros"]/span/text()',
                    MapCompose(lambda x: x.strip(), int))
        l.add_xpath('vagas', '//ul[@class="features"]/li[@title="Vagas"]/span/text()',
                    MapCompose(lambda x: x.strip(), int))
        l.add_xpath('area', '//ul[@class="features"]/li[@title="Área"]/span/text()',
                    MapCompose(lambda x: x.strip(), float))
        l.add_value('url', response.url)
        
        # Housekeeping fields
        l.add_value('project', self.settings.get('BOT_NAME'))
        l.add_value('spider', self.name)
        l.add_value('server', socket.gethostname())
        l.add_value('date', datetime.datetime.now())
        
        return l.load_item()

更新

在搜索日志时,我发现了关于水平爬网的以下内容:

2021-02-22 17:09:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 17:09:24 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

似乎下一页被复制了,但我不知道如何修复它

此外,我意识到,尽管href指向#pagina=2,但实际的url是?pagina=2

有什么提示吗


Tags: lambdatextfromimportaddtitlevaluewww
1条回答
网友
1楼 · 发布于 2024-10-06 10:29:11

实际上,你的蜘蛛连第一页都没有爬

问题存在于允许的\u域中。换成

allowed_domains = ['www.vivareal.com.br']

你会开始爬行。在这一更改之后,您将得到许多错误(正如我在这里看到的,由于逻辑错误,代码引发了异常),但是您的代码将按预期运行

编辑(2)

检查日志:

2021-02-22 13:36:19 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.vivareal.com.br': <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2>

herethis old question所述,基本上,允许的_域没有正确设置

编辑: 要使其清晰:运行问题中定义的爬行器后,我得到的日志是:


2021-02-22 13:29:18 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: properties)
2021-02-22 13:29:18 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.1 (default, Feb  9 2020, 21:34:32) - [GCC 7.4.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020), cryptography 3.3.1, Platform Linux-4.15.0-135-generic-x86_64-with-glibc2.27
2021-02-22 13:29:18 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-02-22 13:29:18 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'properties',
 'NEWSPIDER_MODULE': 'properties.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['properties.spiders']}
2021-02-22 13:29:18 [scrapy.extensions.telnet] INFO: Telnet Password: 3790c3525890efea
2021-02-22 13:29:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-02-22 13:29:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-02-22 13:29:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-02-22 13:29:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-02-22 13:29:18 [scrapy.core.engine] INFO: Spider opened
2021-02-22 13:29:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-02-22 13:29:18 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-02-22 13:29:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/robots.txt> (referer: None)
2021-02-22 13:29:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/> (referer: None)
2021-02-22 13:29:20 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.vivareal.com.br': <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2>
2021-02-22 13:29:20 [scrapy.core.engine] INFO: Closing spider (finished)
2021-02-22 13:29:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 606,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 156997,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 1.87473,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 2, 22, 16, 29, 20, 372722),
 'log_count/DEBUG': 3,
 'log_count/INFO': 10,
 'memusage/max': 54456320,
 'memusage/startup': 54456320,
 'offsite/domains': 1,
 'offsite/filtered': 34,
 'request_depth_max': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2021, 2, 22, 16, 29, 18, 497992)}
2021-02-22 13:29:20 [scrapy.core.engine] INFO: Spider closed (finished)

当我使用建议的更改运行时,日志如下(适于不显示我的路径):

2021-02-22 13:31:47 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: properties)
2021-02-22 13:31:47 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.1 (default, Feb  9 2020, 21:34:32) - [GCC 7.4.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020), cryptography 3.3.1, Platform Linux-4.15.0-135-generic-x86_64-with-glibc2.27
2021-02-22 13:31:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-02-22 13:31:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'properties',
 'NEWSPIDER_MODULE': 'properties.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['properties.spiders']}
2021-02-22 13:31:47 [scrapy.extensions.telnet] INFO: Telnet Password: 65a5f31c8dda80fa
2021-02-22 13:31:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-02-22 13:31:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-02-22 13:31:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-02-22 13:31:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-02-22 13:31:47 [scrapy.core.engine] INFO: Spider opened
2021-02-22 13:31:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-02-22 13:31:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-02-22 13:31:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/robots.txt> (referer: None)
2021-02-22 13:31:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/> (referer: None)
2021-02-22 13:31:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/#pagina=2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-1-quartos-funcionarios-bairros-belo-horizonte-com-garagem-41m2-venda-RS330000-id-2510414426/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-3-quartos-nova-granada-bairros-belo-horizonte-com-garagem-74m2-venda-RS499000-id-2509923918/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-4-quartos-serra-bairros-belo-horizonte-com-garagem-246m2-venda-RS1950000-id-2510579983/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/casa-3-quartos-sao-geraldo-bairros-belo-horizonte-com-garagem-120m2-venda-RS460000-id-2484383176/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-4-quartos-savassi-bairros-belo-horizonte-com-garagem-206m2-venda-RS1790000-id-2503711314/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-2-quartos-paqueta-bairros-belo-horizonte-com-garagem-60m2-venda-RS260000-id-2479637684/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.vivareal.com.br/imovel/apartamento-3-quartos-savassi-bairros-belo-horizonte-com-garagem-107m2-venda-RS1250000-id-2506122689/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
2021-02-22 13:31:50 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.vivareal.com.br/imovel/apartamento-1-quartos-funcionarios-bairros-belo-horizonte-com-garagem-41m2-venda-RS330000-id-2510414426/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spiders/crawl.py", line 114, in _parse_response
    cb_res = callback(response, **cb_kwargs) or ()
  File "/home/leomaffei/properties/properties/spiders/spider.py", line 28, in parse_item
    l.add_xpath('url', 'a/@href', )
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py", line 350, in add_xpath
    self.add_value(field_name, values, *processors, **kw)
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py", line 190, in add_value
    self._add_value(field_name, value)
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py", line 208, in _add_value
    processed_value = self._process_input_value(field_name, value)
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py", line 312, in _process_input_value
    proc = self.get_input_processor(field_name)
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py", line 290, in get_input_processor
    proc = self._get_item_field_attr(
  File "/usr/lib/python3.8/site-packages/itemloaders/__init__.py", line 308, in _get_item_field_attr
    field_meta = ItemAdapter(self.item).get_field_meta(field_name)
  File "/usr/lib/python3.8/site-packages/itemadapter/adapter.py", line 235, in get_field_meta
    return self.adapter.get_field_meta(field_name)
  File "/usr/lib/python3.8/site-packages/itemadapter/adapter.py", line 161, in get_field_meta
    return MappingProxyType(self.item.fields[field_name])
KeyError: 'url'
2021-02-22 13:31:50 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.vivareal.com.br/imovel/apartamento-3-quartos-nova-granada-bairros-belo-horizonte-com-garagem-74m2-venda-RS499000-id-2509923918/> (referer: https://www.vivareal.com.br/venda/minas-gerais/belo-horizonte/)
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/scrapy/utils/defer.py", line 120, in iter_errback
    yield next(it)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/utils/python.py", line 353, in __next__
    return next(self.data)
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/referer.py", line 340, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 62, in _evaluate_iterable
    for r in iterable:
  File "/usr/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filte
...

相关问题 更多 >