.get（'href'）返回None而不是h

from splinter import Browser import bs4 as bs import os import time import csv url = 'https://www.unpri.org/directory/' path = os.getcwd() + "/chromedriver" executable_path = {'executable_path': path} browser = Browser('chrome', **executable_path) browser.visit(url) source = browser.html soup = bs.BeautifulSoup(source,'lxml') for url in soup.find_all('div',class_="col-xs-8 col-md-9"): print(url.get('href', None))

1条回答

网友

1楼 · 发布于 2024-10-03 04:33:23

The idea is to click "show more" until all links are shown, and then just gather the links.

直到所有三个按钮的链接都显示在Selenium上。然后它将整个页面的html保存到一个名为page_source.html的文件中。在

然后用BeautifulSoup解析html，保存到dict（{org_name: url}），然后转储到名为organisations.json的json文件中。在

import json
from time import sleep

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import ElementNotVisibleException


def click_button_until_all_displayed(browser, button_id):
    button = browser.find_element_by_id(button_id)
    while True:
        try:
            button.click()
        except ElementNotVisibleException:
            break
        sleep(1.2)


BASE_URL = 'https://www.unpri.org'
driver = webdriver.Chrome()
driver.get('{}/directory'.format(BASE_URL))

for button_name in ('asset', 'invest', 'services'):
    click_button_until_all_displayed(driver, 'see_all_{}'.format(button_name))

with open('page_source.html', 'w') as f:
    f.write(driver.page_source)

driver.close()

with open('page_source.html', 'r') as f:
    soup = BeautifulSoup(f, 'lxml')

orgs = {}
for div in soup.find_all('div', class_="col-xs-8 col-md-9"):
    org_name = div.h5.a.text.strip()
    orgs[org_name] = '{}{}'.format(BASE_URL, div.h5.a['href'])

with open('organisations.json', 'w') as f:
    json.dump(orgs, f, indent=2)

只花了不到4分钟，所有的链接都显示出来了。如果您想节省一些时间，这里有一个link to the gist显示这个源代码，page_source.html和{}。在

相关问题更多 >

编程相关推荐

热门问题

热门文章