我怎么用请求.Session.get()软泥。我一个人试了一下,但还是失败了

2024-10-01 05:01:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我想用请求.Session.get()从YouTube抓取评论,但我有一个错误,我不知道我写的代码是否正确。你知道吗

我认为主要问题在于:

response = session.get(self.YOUTUBE_COMMENTS_URL.format(youtube_id=youtube_id))
yield self.parse(response)

还有谁能举个更好的例子让我理解请求.Session.get()使用刮痧。你知道吗

刮刀

import scrapy
import time
import requests
import lxml.html
import io
from lxml.cssselect import CSSSelector

class CommentsSpider(scrapy.Spider):
    name = 'comments'
    allowed_domains = ['youtube.com']
    start_urls = ['https://www.youtube.com/watch?v=xHkL9PU7o9k']
    YOUTUBE_COMMENTS_URL = 'https://www.youtube.com/all_comments?v= 
 {youtube_id}'

def start_requests(self):
    session = requests.Session()
    for url in self.start_urls:
        youtube_id = url[32:]
        response = session.get(self.YOUTUBE_COMMENTS_URL.format(youtube_id=youtube_id))
        yield self.parse(response)



def parse(self,response):
    html = response.text
    tree = lxml.html.fromstring(html)
    item_sel = CSSSelector('.comment-item')
    text_sel = CSSSelector('.comment-text-content')
    time_sel = CSSSelector('.time')
    author_sel = CSSSelector('.user-name')

    for item in item_sel(tree):
        yield {'cid': item.get('data-cid'),
               'text': text_sel(item)[0].text_content(),
               'time': time_sel(item)[0].text_content().strip(),
               'author': author_sel(item)[0].text_content()}

输出

我不明白为什么斯拉皮会犯那个错误。你知道吗

2019-08-03 19:14:39 [urllib3.connectionpool] DEBUG: 
https://www.youtube.com:443 "GET /watch?v=xHkL9PU7o9k HTTP/1.1" 200 None
2019-08-03 19:14:40 [scrapy.utils.signal] ERROR: Error caught on signal 
handler: <bound method RefererMiddleware.request_scheduled of 
<scrapy.spidermiddlewares.referer.RefererMiddleware object at 0x04C6CC90>>
Traceback (most recent call last):
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\utils\signal.py", line 30, in send_catch_log
*arguments, **named)
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 343, in 
request_scheduled
redirected_urls = request.meta.get('redirect_urls', [])
AttributeError: 'generator' object has no attribute 'meta'
Unhandled Error
Traceback (most recent call last):
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
self.crawler_process.start()
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\crawler.py", line 309, in start
reactor.run(installSignalHandlers=False)  # blocking call
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\twisted\internet\base.py", line 1272, in run
self.mainLoop()
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\twisted\internet\base.py", line 1281, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\twisted\internet\base.py", line 902, in 
runUntilCurrent
call.func(*call.args, **call.kw)
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\core\engine.py", line 135, in _next_request
self.crawl(request, spider)
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\core\engine.py", line 210, in crawl
self.schedule(request, spider)
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\core\engine.py", line 216, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\core\scheduler.py", line 91, in 
enqueue_request
if not request.dont_filter and self.df.request_seen(request):
builtins.AttributeError: 'generator' object has no attribute 'dont_filter'

2019-08-03 19:14:40 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
self.crawler_process.start()
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\crawler.py", line 309, in start
reactor.run(installSignalHandlers=False)  # blocking call
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\twisted\internet\base.py", line 1272, in run
self.mainLoop()
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\twisted\internet\base.py", line 1281, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\twisted\internet\base.py", line 902, in 
runUntilCurrent
call.func(*call.args, **call.kw)
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\core\engine.py", line 135, in _next_request
self.crawl(request, spider)
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\core\engine.py", line 210, in crawl
self.schedule(request, spider)
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\core\engine.py", line 216, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "c:\users\shahzaib butt\appdata\local\programs\python\python37- 
32\lib\site-packages\scrapy\core\scheduler.py", line 91, in 
enqueue_request
if not request.dont_filter and self.df.request_seen(request):
builtins.AttributeError: 'generator' object has no attribute 'dont_filter'

2019-08-03 19:14:43 [scrapy.core.engine] INFO: Closing spider (finished)
2019-08-03 19:14:43 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 5.006301,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 8, 3, 14, 14, 43, 264882),
'log_count/CRITICAL': 1,
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'start_time': datetime.datetime(2019, 8, 3, 14, 14, 38, 258581)}
2019-08-03 19:14:43 [scrapy.core.engine] INFO: Spider closed (finished)

Tags: inpyselflibpackageslocalsiteusers
1条回答
网友
1楼 · 发布于 2024-10-01 05:01:15

这就是我尝试过的:

import scrapy


class YoutubeComSpider(scrapy.Spider):
    name = 'youtube.com'
    allowed_domains = ['youtube.com']
    start_urls = ['https://www.youtube.com/watch?v=xHkL9PU7o9k']
    YOUTUBE_COMMENTS_URL = 'https://www.youtube.com/all_comments?v={youtube_id}'

    def start_requests(self):
        for url in self.start_urls:
            youtube_id = url[32:]
            main_url = self.YOUTUBE_COMMENTS_URL.format(youtube_id=youtube_id)
            print(main_url)
            yield scrapy.Request(url=main_url, callback=self.parse)

    def parse(self, response):
        pass

这就是我的全部成果:

(base) F:\Projects>scrapy runspider youtube_com.py
2019-08-04 02:16:44 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scrapybot)
2019-08-04 02:16:44 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.7, Platform Windows-10-10.0.17763-SP0
2019-08-04 02:16:44 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2019-08-04 02:16:44 [scrapy.extensions.telnet] INFO: Telnet Password: 7173ce54ae5ff9bb
2019-08-04 02:16:44 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-08-04 02:16:45 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-08-04 02:16:45 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-08-04 02:16:45 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-08-04 02:16:45 [scrapy.core.engine] INFO: Spider opened
2019-08-04 02:16:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-08-04 02:16:45 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
https://www.youtube.com/all_comments?v=xHkL9PU7o9k
2019-08-04 02:16:45 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.youtube.com/watch?v=xHkL9PU7o9k> from <GET https://www.youtube.com/all_comments?v=xHkL9PU7o9k>
2019-08-04 02:16:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=xHkL9PU7o9k> (referer: None)
2019-08-04 02:16:46 [scrapy.core.engine] INFO: Closing spider (finished)
2019-08-04 02:16:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 555,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 51026,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 8, 3, 20, 46, 46, 449213),
 'log_count/DEBUG': 2,
 'log_count/INFO': 9,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2019, 8, 3, 20, 46, 45, 94929)}
2019-08-04 02:16:46 [scrapy.core.engine] INFO: Spider closed (finished)

相关问题 更多 >