<h2>你不需要硒</h2>
<p>硒永远不应该是从web上抓取数据的主要方式。它的速度很慢,通常比它的备选方案需要更多的代码行。尽可能使用<code>requests</code>与<code>lxml</code>解析器结合使用。在这个特定的用例中,您只使用<code>selenium</code>在不同的URL之间切换,这是一种可以很容易地硬编码的东西,从而避免了首先使用它的需要</p>
<pre><code>import requests
from lxml import html
import csv
import re
from datetime import datetime
import json
class GameCrawler(object):
def __init__(self):
self.input_date = input('Specify a date e.g. 2021/07/28: ')
self.date_object = datetime.strptime(self.input_date, "%Y/%m/%d")
self.output_file = '{}.csv'.format(re.sub('/', '-', self.input_date))
self.ROOT_URL = 'https://int.soccerway.com'
self.json_request_url = '{}/a/block_competition_matches_summary'.format(self.ROOT_URL)
self.entry_point = '{}/matches/{}'.format(self.ROOT_URL, self.input_date)
self.session = requests.Session()
self.HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
self.all_game_urls = []
self.league_urls = self.get_league_urls()
def save_to_csv(self):
with open(self.output_file, 'a+') as f:
writer = csv.writer(f)
for row in self.all_game_urls:
writer.writerow([row])
return
def request_other_pages(self, page_params):
params = {
'block_id': 'page_competition_1_block_competition_matches_summary_11',
'callback_params': json.dumps({
"page": page_params['page_count'] + 2,
"block_service_id": "competition_summary_block_competitionmatchessummary",
"round_id": int(page_params['round_id']),
"outgroup":"",
"view":1,
"competition_id": int(page_params['competition_id'])
}),
'action': 'changePage',
'params': json.dumps({"page": page_params['page_count']}),
}
response = self.session.get(self.json_request_url, headers=self.HEADERS, params=params)
if response.status_code != 200:
return
else:
json_data = json.loads(response.text)["commands"][0]["parameters"]["content"]
return html.fromstring(json_data)
def get_page_params(self, tree, response):
res = re.search('r(\d+)?/$', response.url)
if res:
page_params = {
'round_id': res.group(1),
'competition_id': tree.xpath('//*[@data-competition]/@data-competition')[0],
'page_count': len(tree.xpath('//*[@class="page-dropdown"]/option'))
}
return page_params if page_params['page_count'] != 0 else {}
return {}
def match_day_check(self, game):
timestamp = game.xpath('./@data-timestamp')[0]
match_date = datetime.fromtimestamp(int(timestamp))
return True if self.date_object.day == match_date.day else False
def scrape_page(self, tree):
for game in tree.xpath('//*[@data-timestamp]'):
game_url = game.xpath('./td[@class="score-time "]/a/@href')
if game_url and self.match_day_check(game):
self.all_game_urls.append('{}{}'.format(self.ROOT_URL, game_url[0]))
return
def get_league_urls(self):
page = self.session.get(self.entry_point, headers=self.HEADERS)
tree = html.fromstring(page.content)
league_urls = ['{}{}'.format(self.ROOT_URL, league_url) for league_url in tree.xpath('//th[@class="competition-link"]/a/@href')]
return league_urls
def main(self):
for index, league_url in enumerate(self.league_urls):
response = self.session.get(league_url, headers=self.HEADERS)
tree = html.fromstring(response.content)
self.scrape_page(tree)
page_params = self.get_page_params(tree, response)
if page_params.get('page_count', 0) != 0:
while True:
page_params['page_count'] = page_params['page_count'] - 1
if page_params['page_count'] == 0:
break
tree = self.request_other_pages(page_params)
if tree is None:
continue
self.scrape_page(tree)
print('Retrieved links for {} out of {} competitions'.format(index+1, len(self.league_urls)))
self.save_to_csv()
return
if __name__ == '__main__':
GameCrawler().main()
</code></pre>
<h2>那么什么时候硒值得使用呢</h2>
<p>如今,网站通常提供动态内容,因此如果您想要检索的数据不是静态加载的:</p>
<ol>
<li>检查浏览器的“网络”选项卡以查看是否有请求
特定于您感兴趣的数据,以及</li>
<li>试着用<code>requests</code>来模拟它</李>
</ol>
<p>如果由于网页的设计方式,第1点和第2点是不可能的,那么最好的选择是使用<code>selenium</code>,它将通过模拟用户交互获取所需的内容。对于HTML解析,您仍然可以选择使用<code>lxml</code>,或者您可以坚持使用<code>selenium</code>,它也提供了该功能</p>
<p>第一次编辑:</p>
<ul>
<li>修正了OP提出的问题</li>
<li>包括对所提供代码的限制</li>
<li>代码重构</li>
<li>添加了日期检查,以确保仅保存在指定日期播放的比赛</li>
<li>添加了允许保存搜索结果的功能</li>
</ul>
<p>第二次编辑:</p>
<ul>
<li>添加了使用<code>get_page_params()</code>和<code>request_other_pages()</code>浏览每个列出的竞赛的所有页面的功能</li>
<li>更多代码重构</li>
</ul>