如何使用Beautifulsoup或Selenium提取图像的标题和src？

[<span> <img alt="Idioma Aleman" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ale.png" title="Idioma Aleman"/> <img alt="Idioma Chino-tradicional" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/chi.png" title="Idioma Chino-tradicional"/> <img alt="Idioma Coreano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/cor.png" title="Idioma Coreano"/> <img alt="Idioma Español" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español"/> <img alt="Idioma Español-latino" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español-latino"/> <img alt="Idioma Frances" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/fra.png" title="Idioma Frances"/> <img alt="Idioma Ingles" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ing.png" title="Idioma Ingles"/> <img alt="Idioma Italiano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ita.png" title="Idioma Italiano"/> <img alt="Idioma Portugues" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/por.png" title="Idioma Portugues"/> <img alt="Idioma Ruso" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/rus.png" title="Idioma Ruso"/> </span>]

2条回答

网友

1楼 · 编辑于 2024-10-01 22:32:50

要使用Selenium和python从所有<span>中提取title和src属性，必须为visibility_of_all_elements_located()诱导WebDriverWait，并且可以使用以下任一Locator Strategies：

对标题使用CSS_SELECTOR：

print([my_elem.get_attribute("title") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".idioma > span:nth-child(1) img.post_flagen[alt^='Idioma']")))])

对src使用XPATH

print([my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[contains(@class, 'idioma')]//span//img[starts-with(@alt, 'Idioma') and @class='post_flagen']")))])

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

网友

2楼 · 编辑于 2024-10-01 22:32:50

为什么你得不到所有冠军？

这是因为在惯用语中只有一个元素，而您使用的find()只能获得第一个匹配项

您可以这样做：

idioma = [''.join(elem['title']) for elem in idioma.findAll('img')]
print (idioma)

输出

['Idioma Aleman', 'Idioma Chino-tradicional', 'Idioma Coreano', 'Idioma Español', 'Idioma Español-latino', 'Idioma Frances', 'Idioma Ingles', 'Idioma Italiano', 'Idioma Portugues', 'Idioma Ruso']

根据评论添加工作示例

import bs4

content ='''<span>
<img alt="Idioma Aleman" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ale.png" title="Idioma Aleman"/>
<img alt="Idioma Chino-tradicional" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/chi.png" title="Idioma Chino-tradicional"/>
<img alt="Idioma Coreano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/cor.png" title="Idioma Coreano"/>
<img alt="Idioma Español" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español"/>
<img alt="Idioma Español-latino" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español-latino"/>
<img alt="Idioma Frances" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/fra.png" title="Idioma Frances"/>
<img alt="Idioma Ingles" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ing.png" title="Idioma Ingles"/>
<img alt="Idioma Italiano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ita.png" title="Idioma Italiano"/>
<img alt="Idioma Portugues" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/por.png" title="Idioma Portugues"/>
<img alt="Idioma Ruso" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/rus.png" title="Idioma Ruso"/>
</span>'''

soup = bs4.BeautifulSoup(content)

以下是不同之处：

idiomaSpan = soup.select_one('span')

idioma = [''.join(elem['title']) for elem in idiomaSpan.find_all('img')]
print (idioma)

相关问题更多 >

编程相关推荐

热门问题

热门文章