如何使用Beautifulsoup或Selenium提取图像的标题和src?

2024-10-01 22:32:50 发布

您现在位置:Python中文网/ 问答频道 /正文

因此,我的所有页面内容都包含:

content = driver.page_source
soup = BeautifulSoup(content, features="html.parser")

然后,我做了这个:

idioma = soup.select(".idioma > span:nth-child(1)")

这给了我这个:

[<span>
<img alt="Idioma Aleman" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ale.png" title="Idioma Aleman"/>
<img alt="Idioma Chino-tradicional" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/chi.png" title="Idioma Chino-tradicional"/>
<img alt="Idioma Coreano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/cor.png" title="Idioma Coreano"/>
<img alt="Idioma Español" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español"/>
<img alt="Idioma Español-latino" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español-latino"/>
<img alt="Idioma Frances" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/fra.png" title="Idioma Frances"/>
<img alt="Idioma Ingles" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ing.png" title="Idioma Ingles"/>
<img alt="Idioma Italiano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ita.png" title="Idioma Italiano"/>
<img alt="Idioma Portugues" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/por.png" title="Idioma Portugues"/>
<img alt="Idioma Ruso" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/rus.png" title="Idioma Ruso"/>
</span>]

当我这样做以获得头衔时:

idioma = [''.join(elem.find('img')['title']) for elem in idioma if elem]

我只得到了第一个

['Idioma Aleman']

为什么我不能把所有人都弄到手


Tags: httpssrcimgtitlewwwcontentaltpost
2条回答

要使用Selenium从所有<span>中提取titlesrc属性,必须为visibility_of_all_elements_located()诱导WebDriverWait,并且可以使用以下任一Locator Strategies

  • 标题使用CSS_SELECTOR

    print([my_elem.get_attribute("title") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".idioma > span:nth-child(1) img.post_flagen[alt^='Idioma']")))])
    
  • src使用XPATH

    print([my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[contains(@class, 'idioma')]//span//img[starts-with(@alt, 'Idioma') and @class='post_flagen']")))])
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

为什么你得不到所有冠军?

这是因为在惯用语中只有一个元素,而您使用的find()只能获得第一个匹配项

您可以这样做:

idioma = [''.join(elem['title']) for elem in idioma.findAll('img')]
print (idioma)

输出

['Idioma Aleman', 'Idioma Chino-tradicional', 'Idioma Coreano', 'Idioma Español', 'Idioma Español-latino', 'Idioma Frances', 'Idioma Ingles', 'Idioma Italiano', 'Idioma Portugues', 'Idioma Ruso']

根据评论添加工作示例

import bs4

content ='''<span>
<img alt="Idioma Aleman" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ale.png" title="Idioma Aleman"/>
<img alt="Idioma Chino-tradicional" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/chi.png" title="Idioma Chino-tradicional"/>
<img alt="Idioma Coreano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/cor.png" title="Idioma Coreano"/>
<img alt="Idioma Español" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español"/>
<img alt="Idioma Español-latino" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español-latino"/>
<img alt="Idioma Frances" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/fra.png" title="Idioma Frances"/>
<img alt="Idioma Ingles" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ing.png" title="Idioma Ingles"/>
<img alt="Idioma Italiano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ita.png" title="Idioma Italiano"/>
<img alt="Idioma Portugues" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/por.png" title="Idioma Portugues"/>
<img alt="Idioma Ruso" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/rus.png" title="Idioma Ruso"/>
</span>'''

soup = bs4.BeautifulSoup(content)

以下是不同之处:

idiomaSpan = soup.select_one('span')

idioma = [''.join(elem['title']) for elem in idiomaSpan.find_all('img')]
print (idioma)

相关问题 更多 >

    热门问题