<p>站点使用基于cookie和用户代理的保护。你可以这样检查。在Chrome中打开DevTools。导航到目标页面<a href="http://www.dwarozh.net/sport/" rel="nofollow noreferrer">http://www.dwarozh.net/sport/</a>,然后在“网络”选项卡中右键单击该页面的请求并“复制为CURL”
打开控制台并运行卷曲:</p>
<pre><code>$ curl 'http://www.dwarozh.net/sport/all-hawal.aspx?cor=3&Nawnishan=%D9%88%DB%95%D8%B1%D8%B2%D8%B4%DB%95%DA%A9%D8%A7%D9%86%DB%8C%20%D8%AF%DB%8C%DA%A9%DB%95' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4,es;q=0.2' -H 'Upgrade-Insecure-Requests: 1' -H 'X-Compress: null' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Referer: http://www.dwarozh.net/sport/details.aspx?jimare=10505' -H 'Cookie: __cfduid=dc9867; sucuri_cloudproxy_uuid_ce28bca9c=d36ad9; ASP.NET_SessionId=wqdo0v; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c=6ab0; _gat=1; __asc=7c0b5; __auc=35; _ga=GA1.2.19688' -H 'Connection: keep-alive' compressed
</code></pre>
<p>您将看到普通的html代码。如果从请求中删除用户代理的cookie,则会得到cap页。在</p>
<p>让我们来查一查吧:</p>
^{pr2}$
<p>太好了!让我们做一只蜘蛛:</p>
<p>我修改了你的,因为我没有一些组件的源代码。在</p>
<pre><code>from scrapy import Spider, Request
from scrapy.selector import Selector
import scrapy
#from Stack.items import StackItem
#from bs4 import BeautifulSoup
from scrapy import log
from scrapy.utils.response import open_in_browser
class StackSpider(Spider):
name = "dwarozh"
start_urls = [
"http://www.dwarozh.net/sport/",
]
_cookie_str = '''__cfduid=dc986; sucuri_cloudproxy_uuid_ce=d36a; ASP.NET_SessionId=wq; __atuvc=1%7C49; sucuri_cloudproxy_uuid_0d5c97a96=6a; _gat=1; __asc=7c0b; __auc=3; _ga=GA1.2.196.14'''
_user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/5 (KHTML, like Gecko) Chrome/54 Safari/5'
def start_requests(self):
cookies = dict(pair.split('=') for pair in self._cookie_str.split('; '))
return [Request(url=url, cookies=cookies, headers={'User-Agent': self._user_agent})
for url in self.start_urls]
def parse(self, response):
mItems = Selector(response).xpath('//div[@class="news-more-img"]/ul/li')
for mItem in mItems:
item = {} # StackItem()
item['title'] = mItem.xpath('a/h2/text()').extract_first()
item['url'] = mItem.xpath('viewa/@href').extract_first()
yield {'url': item['url'], 'title': item['title']}
</code></pre>
<p>让我们运行它:</p>
<pre><code>$ scrapy crawl dwarozh -o - -t csv loglevel=DEBUG
/Users/el/Projects/scrap_woman/.env/lib/python3.4/importlib/_bootstrap.py:321: ScrapyDeprecationWarning: Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the documentation to learn more.
return f(*args, **kwds)
2016-12-10 00:18:55 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrap1)
2016-12-10 00:18:55 [scrapy] INFO: Overridden settings: {'SPIDER_MODULES': ['scrap1.spiders'], 'FEED_FORMAT': 'csv', 'BOT_NAME': 'scrap1', 'FEED_URI': 'stdout:', 'NEWSPIDER_MODULE': 'scrap1.spiders', 'ROBOTSTXT_OBEY': True}
2016-12-10 00:18:55 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2016-12-10 00:18:55 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-10 00:18:55 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-10 00:18:55 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-10 00:18:55 [scrapy] INFO: Spider opened
2016-12-10 00:18:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-10 00:18:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2016-12-10 00:18:55 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/robots.txt> (referer: None)
2016-12-10 00:18:56 [scrapy] DEBUG: Crawled (200) <GET http://www.dwarozh.net/sport/> (referer: None)
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nلیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nهەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nگرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nبەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە'}
2016-12-10 00:18:56 [scrapy] DEBUG: Scraped from <200 http://www.dwarozh.net/sport/>
{'url': None, 'title': '\nكچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە'}
2016-12-10 00:18:56 [scrapy] INFO: Closing spider (finished)
2016-12-10 00:18:56 [scrapy] INFO: Stored csv feed (5 items) in: stdout:
2016-12-10 00:18:56 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 950,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 15121,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 12, 9, 21, 18, 56, 271371),
'item_scraped_count': 5,
'log_count/DEBUG': 8,
'log_count/INFO': 8,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 12, 9, 21, 18, 55, 869851)}
2016-12-10 00:18:56 [scrapy] INFO: Spider closed (finished)
url,title
,"
لیستی یاریزانانی ریاڵ مەدرید بۆ یاری سبەی ڕاگەیەنراو پێنج یاریزان دورخرانەوە"
,"
هەواڵێکی ناخۆش بۆ هاندەرانی ریاڵ مەدرید"
,"
گرنگترین مانشێتی ئەمرۆ هەینی رۆژنامەکانی ئیسپانیا"
,"
بەفەرمی یۆفا پێكهاتەی نموونەی جەولەی شەشەم و کۆتایی چامپیۆنس لیگی بڵاو کردەوە"
,"
كچە یاریزانێك دەبێتە هۆیی دروست بوونی تیپێكی تۆكمە"
</code></pre>
<p>你可能需要不时更新cookies。你可以使用幻影。在</p>
<p><strong>更新</strong>:</p>
<p>如何使用PhantomJS获取cookies。在</p>
<ol>
<li><p>安装<a href="http://phantomjs.org/quick-start.html" rel="nofollow noreferrer">PhantomJS</a>。</p></li>
<li><p>编写如下脚本<code>dwarosh.js</code>:</p>
<pre><code>var page = require('webpage').create();
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.dwarozh.net/sport/', function(status) {
console.log("Status: " + status);
if(status === "success") {
page.render('example.png');
page.evaluate(function() {
return document.title;
});
}
for (var i=0; i<page.cookies.length; i++) {
var c = page.cookies[i];
console.log(c.name, c.value);
};
phantom.exit();
});
</code></pre></li>
<li><p>运行脚本:</p>
<pre><code> $ phantomjs cookies-file=cookie.txt dwarosh.js
TypeError: undefined is not an object (evaluating 'activeElement.position().left')
http://www.dwarozh.net/sport/js/script.js:5
https://code.jquery.com/jquery-1.10.2.min.js:4 in c
https://code.jquery.com/jquery-1.10.2.min.js:4 in fireWith
https://code.jquery.com/jquery-1.10.2.min.js:4 in ready
https://code.jquery.com/jquery-1.10.2.min.js:4 in q
Status: success
__auc 250ab0a9158ee9e73eeeac78bba
__asc 250ab0a9158ee9e73eeeac78bba
_gat 1
_ga GA1.2.260482211.1481472111
ASP.NET_SessionId vs1utb1nyblqkxprxgazh0g2
sucuri_cloudproxy_uuid_3e07984e4 26e4ab3...
__cfduid d9059962a4c12e0f....1
</code></pre></li>
<li><p>获取cookie <code>sucuri_cloudproxy_uuid_3e07984e4</code>,并尝试使用<code>curl</code>和相同的用户代理来获取页面。在</p>
<pre><code>$ curl -v http://www.dwarozh.net/sport/ -b sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465 -A SpecialAgent
* Trying 104.25.209.23...
* Connected to www.dwarozh.net (104.25.209.23) port 80 (#0)
> GET /sport/ HTTP/1.1
> Host: www.dwarozh.net
> User-Agent: SpecialAgent
> Accept: */*
> Cookie: sucuri_cloudproxy_uuid_3e07984e4=26e4ab377efbf766d4be7eff20328465
>
< HTTP/1.1 200 OK
< Date: Sun, 11 Dec 2016 16:17:04 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Set-Cookie: __cfduid=d1646515f5ba28212d4e4ca562e2966311481473024; expires=Mon, 11-Dec-17 16:17:04 GMT; path=/; domain=.dwarozh.net; HttpOnly
< Cache-Control: private
< Vary: Accept-Encoding
< Set-Cookie: ASP.NET_SessionId=srxyurlfpzxaxn1ufr0dvxc2; path=/; HttpOnly
< X-AspNet-Version: 4.0.30319
< X-XSS-Protection: 1; mode=block
< X-Frame-Options: SAMEORIGIN
< X-Content-Type-Options: nosniff
< X-Sucuri-ID: 15008
< Server: cloudflare-nginx
< CF-RAY: 30fa3ea1335237b0-ARN
<
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><title>
Dwarozh : Sport
</title><meta content="دواڕۆژ سپۆرت هەواڵی ناوخۆ،هەواڵی جیهانی، وەرزشەکانی دیکە" name="description"/><meta property="fb:app_id" content="1713056075578566"/><meta content="initial-scale=1.0, width=device-width, maximum-scale=1.0, user-scalable=no" name="viewport"/><link href="wene/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="wene/style.css" rel="stylesheet" type="text/css"/>
<script src="js/jquery-2.1.1.js" type="text/javascript"></script>
<script src="https://code.jquery.com/jquery-1.10.2.min.js" type="text/javascript"></script>
<script src="js/script.js" type="text/javascript"></script>
<link href="css/styles.css" rel="stylesheet"/>
<script src="js/classie.js" type="text/javascript"></script>
<script type="text/javascript">
</code></pre></li>
</ol>