不使用WebDriverWait我的代码返回：元素单击截获/使用WebDriverWait返回'NoneType'对象不可编辑问题的回答

不使用WebDriverWait我的代码返回：元素单击截获/使用WebDriverWait返回'NoneType'对象不可编辑

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

代码提案： 收集网页（<a href="https://int.soccerway.com/matches/2021/07/28/" rel="nofollow noreferrer">https://int.soccerway.com/matches/2021/07/28/</a>）上当天所有游戏的链接，让我可以自由更改日期，如<code>2021/08/01</code>等。因此，将来我可以在一次代码调用中同时循环和收集来自不同日期的列表 尽管它是一个非常慢的模型，但不使用<code>Headless</code>，该模型单击所有按钮，展开数据并导入所有列出的465个匹配链接： <pre class="lang-py prettyprint-override"><code>for btn in driver.find_elements_by_xpath("//tr[contains(@class,'group-head clickable')]"): btn.click() </code></pre> 完整代码： <pre class="lang-py prettyprint-override"><code>from selenium import webdriver from selenium.webdriver.chrome.options import Options import time options = Options() options.add_argument("start-maximized") options.add_experimental_option("excludeSwitches", ["enable-logging"]) driver = webdriver.Chrome(r"C:\Users\Computador\Desktop\Python\chromedriver.exe", options=options) url = "https://int.soccerway.com/matches/2021/07/28/" driver.get(url) driver.find_element_by_xpath("//div[@class='language-picker-trigger']").click() driver.find_element_by_xpath("//a[@href='https://int.soccerway.com']").click() time.sleep(10) for btn in driver.find_elements_by_xpath("//tr[contains(@class,'group-head clickable')]"): btn.click() time.sleep(10) jogos = driver.find_elements_by_xpath("//td[contains(@class,'score-time')]//a") for jogo in jogos: resultado = jogo.get_attribute("href") print(resultado) driver.quit() </code></pre> 但是，当我添加<code>options.add_argument("headless")</code>使浏览器无法在屏幕上打开时，模型返回以下错误： <blockquote> Message: element click intercepted </blockquote> 为了解决这个问题，我分析了选项，在<code>WebDriverWait</code>（<a href="https://stackoverflow.com/a/62904494/11462274">https://stackoverflow.com/a/62904494/11462274</a>）上找到了这个选项，并尝试这样使用它： <pre class="lang-py prettyprint-override"><code>for btn in WebDriverWait(driver, 1).until(EC.element_to_be_clickable((By.XPATH, "//tr[contains(@class,'group-head clickable')]"))): btn.click() </code></pre> 完整代码： <pre class="lang-py prettyprint-override"><code>from selenium import webdriver from selenium.webdriver.chrome.options import Options import time from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC options = Options() options.add_argument("start-maximized") options.add_argument("headless") options.add_experimental_option("excludeSwitches", ["enable-logging"]) driver = webdriver.Chrome(r"C:\Users\Computador\Desktop\Python\chromedriver.exe", options=options) url = "https://int.soccerway.com/matches/2021/07/28/" driver.get(url) driver.find_element_by_xpath("//div[@class='language-picker-trigger']").click() driver.find_element_by_xpath("//a[@href='https://int.soccerway.com']").click() time.sleep(10) for btn in WebDriverWait(driver, 1).until(EC.element_to_be_clickable((By.XPATH, "//tr[contains(@class,'group-head clickable')]"))): btn.click() time.sleep(10) jogos = driver.find_elements_by_xpath("//td[contains(@class,'score-time')]//a") for jogo in jogos: resultado = jogo.get_attribute("href") print(resultado) driver.quit() </code></pre> 但因为它不可iterable，所以返回时出错： <blockquote> 'NoneType' object is not iterable </blockquote> 我为什么需要此选项？ 1-我将在一个在线终端上实现自动化，这样屏幕上就不会有任何浏览器可以打开，我需要加快速度，这样我就不会在终端上花费太多的时间限制 2-我需要找到一个选项，可以使用任何日期，而不是<code>2021/07/28</code>在： <pre><code>url = "https://int.soccerway.com/matches/2021/07/28/" </code></pre> 其中，将来我将添加参数： <pre><code>today = date.today().strftime("%Y/%m/%d") </code></pre> 在这个答案（<a href="https://stackoverflow.com/a/68535595/11462274">https://stackoverflow.com/a/68535595/11462274</a>）中，一个家伙指出了一个非常快速和有趣的选项（他在答案末尾将该选项命名为：更快的版本），而不需要<code>WebDriver</code>，但当我尝试使用一年中的其他日期时，我只能在网站的第一页上使用它，他一直只返回当前游戏的链接 预期结果（共有465个链接，但由于存在字符限制，所以我没有给出整个结果）： <pre><code>https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fc-sheriff-tiraspol/alashkert-fc/3517568/ https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-neftchi/olympiakos-cfp/3517569/ https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/scs-cfr-1907-cluj-sa/newcastle-fc/3517571/ https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fc-midtjylland/celtic-fc/3517576/ https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-razgrad-2000/mura/3517574/ https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/galatasaray-sk/psv-nv/3517577/ https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/bsc-young-boys-bern/k-slovan-bratislava/3517566/ https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/fk-crvena-zvezda-beograd/fc-kairat-almaty/3517570/ https://int.soccerway.com/matches/2021/07/28/europe/uefa-champions-league/ac-sparta-praha/sk-rapid-wien/3517575/ https://int.soccerway.com/matches/2021/07/28/world/olympics/saudi-arabia-u23/brazil--under-23/3497390/ https://int.soccerway.com/matches/2021/07/28/world/olympics/germany-u23/cote-divoire-u23/3497391/ https://int.soccerway.com/matches/2021/07/28/world/olympics/romania-u23/new-zealand-under-23/3497361/ https://int.soccerway.com/matches/2021/07/28/world/olympics/korea-republic-u23/honduras-u23/3497362/ https://int.soccerway.com/matches/2021/07/28/world/olympics/australia-under-23/egypt-under-23/3497383/ https://int.soccerway.com/matches/2021/07/28/world/olympics/spain-under-23/argentina-under-23/3497384/ https://int.soccerway.com/matches/2021/07/28/world/olympics/france-u23/japan-u23/3497331/ https://int.soccerway.com/matches/2021/07/28/world/olympics/south-africa-u23/mexico-u23/3497332/ https://int.soccerway.com/matches/2021/07/28/africa/cecafa-senior-challenge-cup/uganda-under-23/eritrea-under-23/3567664/ </code></pre> 注1:有多种类型的<code>score-time</code>，例如<code>score-time status</code>和<code>score-time score</code>，这就是我在<code>"//td[contains(@class,'score-time')]//a"</code>中使用<code>contains</code>的原因 <h2>更新</h2> 如果可能的话，除了帮助我解决当前的问题外，我还对当前使用的方法的改进和更快的选项感兴趣。（我还在学习，所以我的方法很陈旧）

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<h2>你不需要硒</h2> 硒永远不应该是从web上抓取数据的主要方式。它的速度很慢，通常比它的备选方案需要更多的代码行。尽可能使用<code>requests</code>与<code>lxml</code>解析器结合使用。在这个特定的用例中，您只使用<code>selenium</code>在不同的URL之间切换，这是一种可以很容易地硬编码的东西，从而避免了首先使用它的需要 <pre><code>import requests from lxml import html import csv import re from datetime import datetime import json class GameCrawler(object): def __init__(self): self.input_date = input('Specify a date e.g. 2021/07/28: ') self.date_object = datetime.strptime(self.input_date, "%Y/%m/%d") self.output_file = '{}.csv'.format(re.sub('/', '-', self.input_date)) self.ROOT_URL = 'https://int.soccerway.com' self.json_request_url = '{}/a/block_competition_matches_summary'.format(self.ROOT_URL) self.entry_point = '{}/matches/{}'.format(self.ROOT_URL, self.input_date) self.session = requests.Session() self.HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'} self.all_game_urls = [] self.league_urls = self.get_league_urls() def save_to_csv(self): with open(self.output_file, 'a+') as f: writer = csv.writer(f) for row in self.all_game_urls: writer.writerow([row]) return def request_other_pages(self, page_params): params = { 'block_id': 'page_competition_1_block_competition_matches_summary_11', 'callback_params': json.dumps({ "page": page_params['page_count'] + 2, "block_service_id": "competition_summary_block_competitionmatchessummary", "round_id": int(page_params['round_id']), "outgroup":"", "view":1, "competition_id": int(page_params['competition_id']) }), 'action': 'changePage', 'params': json.dumps({"page": page_params['page_count']}), } response = self.session.get(self.json_request_url, headers=self.HEADERS, params=params) if response.status_code != 200: return else: json_data = json.loads(response.text)["commands"][0]["parameters"]["content"] return html.fromstring(json_data) def get_page_params(self, tree, response): res = re.search('r(\d+)?/$', response.url) if res: page_params = { 'round_id': res.group(1), 'competition_id': tree.xpath('//*[@data-competition]/@data-competition')[0], 'page_count': len(tree.xpath('//*[@class="page-dropdown"]/option')) } return page_params if page_params['page_count'] != 0 else {} return {} def match_day_check(self, game): timestamp = game.xpath('./@data-timestamp')[0] match_date = datetime.fromtimestamp(int(timestamp)) return True if self.date_object.day == match_date.day else False def scrape_page(self, tree): for game in tree.xpath('//*[@data-timestamp]'): game_url = game.xpath('./td[@class="score-time "]/a/@href') if game_url and self.match_day_check(game): self.all_game_urls.append('{}{}'.format(self.ROOT_URL, game_url[0])) return def get_league_urls(self): page = self.session.get(self.entry_point, headers=self.HEADERS) tree = html.fromstring(page.content) league_urls = ['{}{}'.format(self.ROOT_URL, league_url) for league_url in tree.xpath('//th[@class="competition-link"]/a/@href')] return league_urls def main(self): for index, league_url in enumerate(self.league_urls): response = self.session.get(league_url, headers=self.HEADERS) tree = html.fromstring(response.content) self.scrape_page(tree) page_params = self.get_page_params(tree, response) if page_params.get('page_count', 0) != 0: while True: page_params['page_count'] = page_params['page_count'] - 1 if page_params['page_count'] == 0: break tree = self.request_other_pages(page_params) if tree is None: continue self.scrape_page(tree) print('Retrieved links for {} out of {} competitions'.format(index+1, len(self.league_urls))) self.save_to_csv() return if __name__ == '__main__': GameCrawler().main() </code></pre> <h2>那么什么时候硒值得使用呢</h2> 如今，网站通常提供动态内容，因此如果您想要检索的数据不是静态加载的： <ol> <li>检查浏览器的“网络”选项卡以查看是否有请求特定于您感兴趣的数据，以及</li> <li>试着用<code>requests</code>来模拟它</李> </ol> 如果由于网页的设计方式，第1点和第2点是不可能的，那么最好的选择是使用<code>selenium</code>，它将通过模拟用户交互获取所需的内容。对于HTML解析，您仍然可以选择使用<code>lxml</code>，或者您可以坚持使用<code>selenium</code>，它也提供了该功能 第一次编辑： <ul> <li>修正了OP提出的问题</li> <li>包括对所提供代码的限制</li> <li>代码重构</li> <li>添加了日期检查，以确保仅保存在指定日期播放的比赛</li> <li>添加了允许保存搜索结果的功能</li> </ul> 第二次编辑： <ul> <li>添加了使用<code>get_page_params()</code>和<code>request_other_pages()</code>浏览每个列出的竞赛的所有页面的功能</li> <li>更多代码重构</li> </ul>

不使用WebDriverWait我的代码返回：元素单击截获/使用WebDriverWait返回'NoneType'对象不可编辑

1 个回答

相关Python问题