Selenium使用find_元素通过_xpath进行刮取会导致空白表

2条回答

网友

1楼 · 编辑于 2024-09-27 04:10:17

如果您想刮去Facility Name

示例代码：

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.implicitly_wait(30)
driver.get('http://carefacility.doe.louisiana.gov/covid19/List.aspx?parish=Orleans')
for name in driver.find_elements(By.CSS_SELECTOR, "a[id^='MainContent_gvFacilityList_lbFacility']"):
    print(name.text)

或者，如果您想删除所有日期：

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("http://carefacility.doe.louisiana.gov/covid19/List.aspx?parish=Orleans")
wait = WebDriverWait(driver, 20)
total_rows = len(driver.find_elements(By.XPATH, "//a[contains(@id,'MainContent_gvFacilityList_lbFacility')]"))
all_rows = driver.find_elements(By.XPATH, "//a[contains(@id,'MainContent_gvFacilityList_lbFacility')]")
name = []
license_type = []
age_range = []
city = []
for row in all_rows:
    name.append(row.text)
    license_type.append(row.find_element_by_xpath("../../following-sibling::td[1]").text)
    age_range.append(row.find_element_by_xpath("../../following-sibling::td[2]").text)
    city.append(row.find_element_by_xpath("../../following-sibling::td[3]").text)

print(name, license_type, age_range, city)

您需要这些导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

输出：

"C:\Program Files\Python39\python.exe" C:/Users/***/PycharmProjects/SeleniumSO/Chrome.py
['3 Sisters Academy', 'Academy of the Sacred Heart Little Heart', "ACS Children's House", "ACS Children's House Gentilly", 'Angel Care Learning Center', 'Angels Haven Daycare and Preschool', 'Anthonika Gidney', 'Audubon Primary Academy', 'Audubon Primary Preschool', 'Auntie B Learning Academy', 'Because Wee Care Learning Academy', 'Benjamin Thomas Academy', 'Bridgett White', 'Bright Horizons at Tulane University', 'Bright Minds Academy, LLC', "Carbo's Learning Express", "Carbo's Learning Express-East", 'Carolyn Green Ford Head Start', 'Carrollton-Dunbar Head Start Center', 'Changing Stages', "Children's College of Academics", "Children's Palace Learning Academy", "Children's Palace Learning Academy", "Children's Place Love Center Learning Academy", "Children's Place LTD", "Clara's Little Lambs Preschool #5", "Clara's Little Lambs Preschool Academy", 'Claras Little Lambs at Federal City', 'Coloring House Christian Academy', 'Covered Kids Learning Academy', 'Cream of the Crop', 'Creative Kidz East', 'Crescent Cradle at Cabrini High School', 'Cub Corner Preschool', 'Cuddly Bear Child Development Center', "D J's Learning Castle", 'Danielle Ann Varnado', 'Diana Head Start Center', 'Dionne Harvey', 'Discovery Kids Preschool and Daycare Center', "DJ's Learning Center LLC", 'Dr. Peter W. Dangerfield Head Start Center', 'Dryades YMCA Daycare', 'Early Discovery Child Care Center', 'Early Learning Center of NOBTS', 'Early Partners', 'Ecole Bilingue de la Nouvelle Orleans', 'Educare New Orleans', 'Ethel Woodard', 'First Academy Early Learning Center'] ['Early Learning Center III', 'Early Learning Center I', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Family Child Care Provider', 'Early Learning Center II', 'Early Learning Center II', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center II', 'Family Child Care Provider', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center II', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center I', 'Early Learning Center I', 'Early Learning Center III', 'Early Learning Center III', 'Family Child Care Provider', 'Early Learning Center III', 'Family Child Care Provider', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center I', 'Early Learning Center III', 'In-Home Provider', 'Early Learning Center I'] ['6 W To 12 Y', '6 W To 5 Y', '3 Y To 6 Y', '3 Y To 6 Y', '6 W To 12 Y', '6 W To 13 Y', '0 Y To 12 Y', '6 W To 12 Y', '5 W To 12 Y', '3 W To 12 Y', '6 W To 16 Y', '6 W To 5 Y', '0 Y To 12 Y', '6 W To 5 Y', '6 W To 12 Y', '6 W To 12 Y', '6 W To 12 Y', '1 W To 5 Y', '35 M To 5 Y', '6 W To 16 Y', '6 W To 12 Y', '6 W To 12 Y', '6 W To 12 Y', '6 W To 12 Y', '6 W To 4 Y', '1 W To 12 Y', '6 W To 12 Y', '6 W To 14 Y', '6 W To 17 Y', '6 W To 12 Y', '6 W To 12 Y', '6 W To 14 Y', '6 W To 4 Y', '6 W To 3 Y', '3 M To 12 Y', '6 W To 12 Y', '00 Y To 12 Y', '35 M To 5 Y', '00 Y To 12 Y', '6 W To 12 Y', '6 W To 12 Y', '34 M To 5 Y', '6 M To 12 Y', '8 W To 4 Y', '6 W To 12 Y', '3 Y To 4 Y', '18 M To 5 Y', '6 W To 5 Y', '0 Y To 12 Y', '6 W To 4 Y'] ['New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans']

Process finished with exit code 0

网友
2楼 · 编辑于 2024-09-27 04:10:17

我找到了一个根本不需要使用XPath的解决方案。相反，我将结果的每一页保存为HTML文件
from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager driver = webdriver.Chrome(ChromeDriverManager().install()) driver.maximize_window() driver.implicitly_wait(30) wait = WebDriverWait(driver, 20) # base url url = "http://carefacility.doe.louisiana.gov/covid19/List.aspx?parish=Orleans" #scrape first page driver.get(url) print("scraping page 1") with open(f'htmls/file1.html', 'w') as f: f.write(driver.page_source) #scrape the other pages script = [f"__doPostBack('ctl00$MainContent$gvFacilityList','Page${num}')" for num in range(2,5)] script_counter = 1 for item in script: driver.get(url) driver.execute_script(item) script_counter +=1 print(f"scraping page {script_counter}") with open(f'htmls/file{script_counter}.html', 'w') as f: f.write(driver.page_source)
然后我使用BeautifulSoup刮取每个HTML文件。使用BeautifulSoup创建一个表非常简单，因为您只需要soup.find（“表”），然后将该表保存到数据框中
import pandas as pd from bs4 import BeautifulSoup import glob files = glob.glob('htmls/*') df_full = pd.DataFrame() for file in files: with open(file, 'r') as f: content = f.read() soup = BeautifulSoup(content, 'html.parser') sp_table = soup.find("table") df = pd.read_html(str(sp_table))[0] df_full = df_full.append(df)

相关问题更多 >

编程相关推荐

热门问题

热门文章