.get('href')返回None而不是h

2024-10-03 04:33:23 发布

您现在位置:Python中文网/ 问答频道 /正文

嗨,我想从下面的网站上抓取公司链接https://www.unpri.org/directory/。但是我的代码总是返回None而不是href,这是我的代码。我试着在这里搜索,但似乎找不到其他有同样问题的人。在

这是我的原始代码

from splinter import Browser
import bs4 as bs
import os
import time
import csv

url = 'https://www.unpri.org/directory/'

path = os.getcwd() + "/chromedriver"
executable_path = {'executable_path': path}
browser = Browser('chrome', **executable_path)

browser.visit(url)

source = browser.html

soup = bs.BeautifulSoup(source,'lxml')



for url in soup.find_all('div',class_="col-xs-8 col-md-9"):
    print(url.get('href', None))

Tags: path代码httpsorgimportbrowsernoneurl
1条回答
网友
1楼 · 发布于 2024-10-03 04:33:23

The idea is to click "show more" until all links are shown, and then just gather the links.

直到所有三个按钮的链接都显示在Selenium上。然后它将整个页面的html保存到一个名为page_source.html的文件中。在

然后用BeautifulSoup解析html,保存到dict({org_name: url}),然后转储到名为organisations.json的json文件中。在

import json
from time import sleep

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import ElementNotVisibleException


def click_button_until_all_displayed(browser, button_id):
    button = browser.find_element_by_id(button_id)
    while True:
        try:
            button.click()
        except ElementNotVisibleException:
            break
        sleep(1.2)


BASE_URL = 'https://www.unpri.org'
driver = webdriver.Chrome()
driver.get('{}/directory'.format(BASE_URL))

for button_name in ('asset', 'invest', 'services'):
    click_button_until_all_displayed(driver, 'see_all_{}'.format(button_name))

with open('page_source.html', 'w') as f:
    f.write(driver.page_source)

driver.close()

with open('page_source.html', 'r') as f:
    soup = BeautifulSoup(f, 'lxml')

orgs = {}
for div in soup.find_all('div', class_="col-xs-8 col-md-9"):
    org_name = div.h5.a.text.strip()
    orgs[org_name] = '{}{}'.format(BASE_URL, div.h5.a['href'])

with open('organisations.json', 'w') as f:
    json.dump(orgs, f, indent=2)

只花了不到4分钟,所有的链接都显示出来了。如果您想节省一些时间,这里有一个link to the gist显示这个源代码,page_source.html和{}。在

相关问题 更多 >