发痒的url.startswith('http://bongdaplus.vn/tintuc')

2024-05-20 08:46:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我在Windows10x64、Python2.7.11和Anaconda2.5.0(32位)中使用Scrapy1.0.5。在

这是我预期的结果: links.json链接

[
   "http://bongdaplus.vn/tin-tuc/ngoi-sao/tin-ben-le/chelsea-giam-gia-ao-pato-de-xa-hang-1494361604.html",
   "http://bongdaplus.vn/tin-tuc/ngoi-sao/wag/bo-sergio-ramos-khong-thich-cuoi-chui-1494221604.html",
   "http://bongdaplus.vn/tin-tuc/ngoi-sao/tin-ben-le/suarez-se-lam-ca-sy-neu-khong-da-bong-1494231604.html",
   "http://bongdaplus.vn/tin-tuc/ngoi-sao/tin-ben-le/10-chuyen-that-nhu-dua-ngay-ca-thang-tu-1179421604.html",
   "http://bongdaplus.vn/tin-tuc/viet-nam/tin-khac/the-he-vang-viet-nam-do-khai-van-vuong-nguoi-tinh-bong-da-1358961604.html",
   "http://bongdaplus.vn/tin-tuc/duc/bundesliga/bundesliga-cang-thang-cuoc-dua-du-champions-league-1492811604.html",
   "http://bongdaplus.vn/tin-tuc/the-gioi/nam-my/argentina/nhan-dinh-bong-da-olimpo-vs-rosario-central-07h15-ngay-2-4-1492801604.html",
   "http://bongdaplus.vn/tin-tuc/the-gioi/nhan-dinh-bong-da-new-england-revolution-vs-new-york-red-bulls-06h00-ngay-2-4-1492791604.html",
   "http://bongdaplus.vn/tin-tuc/anh/hang-nhat-anh/nhan-dinh-bong-da-qpr-vs-m-brough-01h45-ngay-2-4-1492781604.html",
   "http://bongdaplus.vn/tin-tuc/ngoi-sao/tin-ben-le/pique-phan-ung-ra-sao-khi-nghe-ca-khuc-truyen-thong-cua-real-1493481604.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/ibrahimovic-muon-nguoi-paris-tac-tuong-minh-thay-thap-eiffel-1476271603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/ibrahimovic-vuot-cot-moc-100-ban-trong-ngay-psg-vo-dich-ligue-i-1475911603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/li-do-psg-qua-vuot-troi-o-ligue-1-1475781603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/psg-vo-dich-ligue-i-voi-hang-loat-ky-luc-da-va-dang-duoc-tao-lap-1475771603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/psg-vua-cua-cac-vi-vua-1476291603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/ibrahimovic-vuot-cot-moc-100-ban-trong-ngay-psg-vo-dich-ligue-i-1475911603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/li-do-psg-qua-vuot-troi-o-ligue-1-1475781603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/psg-vo-dich-ligue-i-voi-hang-loat-ky-luc-da-va-dang-duoc-tao-lap-1475771603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/vui-dap-troyes-9-0-psg-len-ngoi-vo-dich-ligue-i-som-8-vong-dau-1475901603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/nhan-dinh-bong-da-rennes-vs-lyon-03h00-ngay-14-3-quyet-chien-vi-champions-league-1475501603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/nhan-dinh-bong-da-nantes-vs-angers-23h00-ngay-13-3-1475411603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/marseille-bi-cam-hoa-nice-ap-sat-sat-nhom-dan-dau-1475221603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/nhan-dinh-bong-da-troyes-vs-psg-20h00-ngay-13-3-cho-tiec-vo-dich-1475131603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/psg-co-the-vo-dich-ligue-i-ngay-dem-nay-1475091603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/nhan-dinh-bong-da-bastia-vs-lille-02h00-ngay-13-3-1474551603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/nhan-dinh-bong-da-toulouse-vs-bordeaux-02h00-ngay-13-3-1474561603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/nhan-dinh-bong-da-ajaccio-vs-caen-02h00-ngay-13-3-1474521603.html",
   "http://bongdaplus.vn/tin-tuc/phap/ligue-1/nhan-dinh-bong-da-guingamp-st-etienne-02h00-ngay-13-3-1474501603.html"
]

我创建文件bongdaplusdotvn.py

^{pr2}$

和错误:

C:\Users\Administrator\Desktop>scrapy runspider bongdaplusdotvn.py
2016-04-07 14:45:33 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-04-07 14:45:33 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-04-07 14:45:33 [scrapy] INFO: Overridden settings: {}
2016-04-07 14:45:33 [py.warnings] WARNING: C:\Users\Administrator\Desktop\bongdaplusdotvn.py:2: ScrapyDeprecationWarning: Module `scrapy.spider` is deprecated, use `scrapy.spiders` instead
  from scrapy.spider import BaseSpider

2016-04-07 14:45:33 [py.warnings] WARNING: C:\Users\Administrator\Desktop\bongdaplusdotvn.py:8: ScrapyDeprecationWarning: bongdaplusdotvn.MySpider inherits from deprecated class scrapy.spiders.BaseSpider, please inherit from scrapy.spiders.Spider. (warning only on first subclass, there may be others)
  class MySpider(BaseSpider):

2016-04-07 14:45:34 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-04-07 14:45:34 [boto] DEBUG: Retrieving credentials from metadata server.
2016-04-07 14:45:34 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\boto\utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "C:\Users\Administrator\Anaconda2\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:\Users\Administrator\Anaconda2\lib\urllib2.py", line 449, in _open
    '_open', req)
  File "C:\Users\Administrator\Anaconda2\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Users\Administrator\Anaconda2\lib\urllib2.py", line 1227, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "C:\Users\Administrator\Anaconda2\lib\urllib2.py", line 1197, in do_open
    raise URLError(err)
URLError: <urlopen error [Errno 10051] A socket operation was attempted to an unreachable network>
2016-04-07 14:45:34 [boto] ERROR: Unable to read instance data, giving up
2016-04-07 14:45:34 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-07 14:45:34 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-07 14:45:34 [scrapy] INFO: Enabled item pipelines:
2016-04-07 14:45:34 [scrapy] INFO: Spider opened
2016-04-07 14:45:34 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-07 14:45:34 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-04-07 14:45:34 [scrapy] DEBUG: Crawled (200) <GET http://bongdaplus.vn> (referer: None)
2016-04-07 14:45:34 [py.warnings] WARNING: C:\Users\Administrator\Desktop\bongdaplusdotvn.py:18: ScrapyDeprecationWarning: scrapy.selector.HtmlXPathSelector is deprecated, instantiate scrapy.Selector instead.
  hxs = HtmlXPathSelector(response)

2016-04-07 14:45:34 [py.warnings] WARNING: C:\Users\Administrator\Desktop\bongdaplusdotvn.py:19: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
  for url in hxs.select('//a/@href').extract():

2016-04-07 14:45:34 [py.warnings] WARNING: C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\selector\unified.py:108: ScrapyDeprecationWarning: scrapy.selector.HtmlXPathSelector is deprecated, instantiate scrapy.Selector instead.
  for x in result]

http://bongdaplus.vn
http://livescore.bongdaplus.vn
http://video.bongdaplus.vn
/tin-tuc/ban-doc-viet/
2016-04-07 14:45:34 [scrapy] ERROR: Spider error processing <GET http://bongdaplus.vn> (referer: None)
Traceback (most recent call last):
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 28, in process_spider_output
    for x in result:
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\Administrator\Desktop\bongdaplusdotvn.py", line 29, in parse
    yield Request(url, callback=self.parse)
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\http\request\__init__.py", line 24, in __init__
    self._set_url(url)
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\http\request\__init__.py", line 55, in _set_url
    self._set_url(url.encode(self.encoding))
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\http\request\__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: /tin-tuc/ban-doc-viet/
2016-04-07 14:45:35 [scrapy] DEBUG: Crawled (200) <GET http://bongdaplus.vn> (referer: http://bongdaplus.vn)
http://bongdaplus.vn
2016-04-07 14:45:35 [scrapy] DEBUG: Filtered duplicate request: <GET http://bongdaplus.vn> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
http://livescore.bongdaplus.vn
http://video.bongdaplus.vn
/tin-tuc/ban-doc-viet/
2016-04-07 14:45:35 [scrapy] ERROR: Spider error processing <GET http://bongdaplus.vn> (referer: http://bongdaplus.vn)
Traceback (most recent call last):
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 28, in process_spider_output
    for x in result:
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\Administrator\Desktop\bongdaplusdotvn.py", line 29, in parse
    yield Request(url, callback=self.parse)
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\http\request\__init__.py", line 24, in __init__
    self._set_url(url)
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\http\request\__init__.py", line 55, in _set_url
    self._set_url(url.encode(self.encoding))
  File "C:\Users\Administrator\Anaconda2\lib\site-packages\scrapy\http\request\__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: /tin-tuc/ban-doc-viet/
2016-04-07 14:45:35 [scrapy] INFO: Received SIGINT, shutting down gracefully. Send again to force
2016-04-07 14:45:35 [scrapy] INFO: Closing spider (shutdown)
2016-04-07 14:45:36 [scrapy] INFO: Received SIGINT twice, forcing unclean shutdown
2016-04-07 14:45:36 [scrapy] DEBUG: Retrying <GET http://video.bongdaplus.vn> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2016-04-07 14:45:36 [scrapy] DEBUG: Retrying <GET http://livescore.bongdaplus.vn> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>, <twisted.python.failure.Failure twisted.web.http._DataLoss: >]
2016-04-07 14:45:36 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 1,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'downloader/request_bytes': 953,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 83664,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'dupefilter/filtered': 3,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2016, 4, 7, 7, 45, 36, 260000),
 'log_count/DEBUG': 7,
 'log_count/ERROR': 4,
 'log_count/INFO': 9,
 'log_count/WARNING': 3,
 'request_depth_max': 2,
 'response_received_count': 2,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'spider_exceptions/ValueError': 2,
 'start_time': datetime.datetime(2016, 4, 7, 7, 45, 34, 686000)}
2016-04-07 14:45:36 [scrapy] INFO: Spider closed (shutdown)

帮助我爬网和删除所有链接开始于

http://bongdaplus.vn/tin-tuc/

Tags: inpyhttphtmllineplususersfile
1条回答
网友
1楼 · 发布于 2024-05-20 08:46:16

在startswith:

关于:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    for url in hxs.select('//a/@href').extract():
        if (url.startswith('http://bongdaplus.vn/tin-tuc')):
            yield Request(url, callback=self.parse)

相关问题 更多 >