Selenium使用find_元素通过_xpath进行刮取会导致空白表

2024-09-27 04:10:17 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从这个网页中删除表格:http://carefacility.doe.louisiana.gov/covid19/List.aspx?parish=Orleans

我使用Selenium是因为我需要从第1页单击到第2页、第3页和第4页,并使用以下代码刮取每页上的表:driver.execute_script(“uu doPostBack('ctl00$MainContent$gvFacilityList','page$2'))

然而,我连第一张桌子都擦不到。下面的代码没有任何输出——它甚至没有打印“hi”

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('http://carefacility.doe.louisiana.gov/covid19/List.aspx?parish=Orleans')
for tr in driver.find_elements_by_xpath('//*[@id="MainContent_gvFacilityList"]/table/tr'):
    print("hi!")
    tds = tr.find_elements_by_tag_name('td')
    print ([td.text for td in tds])

我已经阅读了Stackoverflow上关于这个问题的其他线程,但是没有一个向我解释为什么我没有得到任何结果


Tags: 代码httpdrivertrlisttdgovdoe
2条回答

如果您想刮去Facility Name

示例代码:

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.implicitly_wait(30)
driver.get('http://carefacility.doe.louisiana.gov/covid19/List.aspx?parish=Orleans')
for name in driver.find_elements(By.CSS_SELECTOR, "a[id^='MainContent_gvFacilityList_lbFacility']"):
    print(name.text)

或者,如果您想删除所有日期:

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("http://carefacility.doe.louisiana.gov/covid19/List.aspx?parish=Orleans")
wait = WebDriverWait(driver, 20)
total_rows = len(driver.find_elements(By.XPATH, "//a[contains(@id,'MainContent_gvFacilityList_lbFacility')]"))
all_rows = driver.find_elements(By.XPATH, "//a[contains(@id,'MainContent_gvFacilityList_lbFacility')]")
name = []
license_type = []
age_range = []
city = []
for row in all_rows:
    name.append(row.text)
    license_type.append(row.find_element_by_xpath("../../following-sibling::td[1]").text)
    age_range.append(row.find_element_by_xpath("../../following-sibling::td[2]").text)
    city.append(row.find_element_by_xpath("../../following-sibling::td[3]").text)

print(name, license_type, age_range, city)

您需要这些导入

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

输出:

"C:\Program Files\Python39\python.exe" C:/Users/***/PycharmProjects/SeleniumSO/Chrome.py
['3 Sisters Academy', 'Academy of the Sacred Heart Little Heart', "ACS Children's House", "ACS Children's House Gentilly", 'Angel Care Learning Center', 'Angels Haven Daycare and Preschool', 'Anthonika Gidney', 'Audubon Primary Academy', 'Audubon Primary Preschool', 'Auntie B Learning Academy', 'Because Wee Care Learning Academy', 'Benjamin Thomas Academy', 'Bridgett White', 'Bright Horizons at Tulane University', 'Bright Minds Academy, LLC', "Carbo's Learning Express", "Carbo's Learning Express-East", 'Carolyn Green Ford Head Start', 'Carrollton-Dunbar Head Start Center', 'Changing Stages', "Children's College of Academics", "Children's Palace Learning Academy", "Children's Palace Learning Academy", "Children's Place Love Center Learning Academy", "Children's Place LTD", "Clara's Little Lambs Preschool #5", "Clara's Little Lambs Preschool Academy", 'Claras Little Lambs at Federal City', 'Coloring House Christian Academy', 'Covered Kids Learning Academy', 'Cream of the Crop', 'Creative Kidz East', 'Crescent Cradle at Cabrini High School', 'Cub Corner Preschool', 'Cuddly Bear Child Development Center', "D J's Learning Castle", 'Danielle Ann Varnado', 'Diana Head Start Center', 'Dionne Harvey', 'Discovery Kids Preschool and Daycare Center', "DJ's Learning Center LLC", 'Dr. Peter W. Dangerfield Head Start Center', 'Dryades YMCA Daycare', 'Early Discovery Child Care Center', 'Early Learning Center of NOBTS', 'Early Partners', 'Ecole Bilingue de la Nouvelle Orleans', 'Educare New Orleans', 'Ethel Woodard', 'First Academy Early Learning Center'] ['Early Learning Center III', 'Early Learning Center I', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Family Child Care Provider', 'Early Learning Center II', 'Early Learning Center II', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center II', 'Family Child Care Provider', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center II', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center I', 'Early Learning Center I', 'Early Learning Center III', 'Early Learning Center III', 'Family Child Care Provider', 'Early Learning Center III', 'Family Child Care Provider', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center III', 'Early Learning Center I', 'Early Learning Center III', 'In-Home Provider', 'Early Learning Center I'] ['6 W To 12 Y', '6 W To 5 Y', '3 Y To 6 Y', '3 Y To 6 Y', '6 W To 12 Y', '6 W To 13 Y', '0 Y To 12 Y', '6 W To 12 Y', '5 W To 12 Y', '3 W To 12 Y', '6 W To 16 Y', '6 W To 5 Y', '0 Y To 12 Y', '6 W To 5 Y', '6 W To 12 Y', '6 W To 12 Y', '6 W To 12 Y', '1 W To 5 Y', '35 M To 5 Y', '6 W To 16 Y', '6 W To 12 Y', '6 W To 12 Y', '6 W To 12 Y', '6 W To 12 Y', '6 W To 4 Y', '1 W To 12 Y', '6 W To 12 Y', '6 W To 14 Y', '6 W To 17 Y', '6 W To 12 Y', '6 W To 12 Y', '6 W To 14 Y', '6 W To 4 Y', '6 W To 3 Y', '3 M To 12 Y', '6 W To 12 Y', '00 Y To 12 Y', '35 M To 5 Y', '00 Y To 12 Y', '6 W To 12 Y', '6 W To 12 Y', '34 M To 5 Y', '6 M To 12 Y', '8 W To 4 Y', '6 W To 12 Y', '3 Y To 4 Y', '18 M To 5 Y', '6 W To 5 Y', '0 Y To 12 Y', '6 W To 4 Y'] ['New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans', 'New Orleans']

Process finished with exit code 0

我找到了一个根本不需要使用XPath的解决方案。相反,我将结果的每一页保存为HTML文件

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.implicitly_wait(30)
wait = WebDriverWait(driver, 20)

# base url
url = "http://carefacility.doe.louisiana.gov/covid19/List.aspx?parish=Orleans"

#scrape first page
driver.get(url)
print("scraping page 1")
with open(f'htmls/file1.html', 'w') as f:
    f.write(driver.page_source)

#scrape the other pages
script = [f"__doPostBack('ctl00$MainContent$gvFacilityList','Page${num}')" for num in range(2,5)]
script_counter = 1
for item in script:
    driver.get(url)
    driver.execute_script(item)
    script_counter +=1
    print(f"scraping page {script_counter}")
    with open(f'htmls/file{script_counter}.html', 'w') as f:
        f.write(driver.page_source)

然后我使用BeautifulSoup刮取每个HTML文件。使用BeautifulSoup创建一个表非常简单,因为您只需要soup.find(“表”),然后将该表保存到数据框中

import pandas as pd
from bs4 import BeautifulSoup
import glob
files = glob.glob('htmls/*')

df_full = pd.DataFrame()

for file in files:
    with open(file, 'r') as f:
        content = f.read()
        soup = BeautifulSoup(content, 'html.parser')
        sp_table = soup.find("table")
        df = pd.read_html(str(sp_table))[0]
        df_full = df_full.append(df)
    

相关问题 更多 >

    热门问题