<p>我找到了一个根本不需要使用XPath的解决方案。相反,我将结果的每一页保存为HTML文件</p>
<pre><code>from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.implicitly_wait(30)
wait = WebDriverWait(driver, 20)
# base url
url = "http://carefacility.doe.louisiana.gov/covid19/List.aspx?parish=Orleans"
#scrape first page
driver.get(url)
print("scraping page 1")
with open(f'htmls/file1.html', 'w') as f:
f.write(driver.page_source)
#scrape the other pages
script = [f"__doPostBack('ctl00$MainContent$gvFacilityList','Page${num}')" for num in range(2,5)]
script_counter = 1
for item in script:
driver.get(url)
driver.execute_script(item)
script_counter +=1
print(f"scraping page {script_counter}")
with open(f'htmls/file{script_counter}.html', 'w') as f:
f.write(driver.page_source)
</code></pre>
<p>然后我使用BeautifulSoup刮取每个HTML文件。使用BeautifulSoup创建一个表非常简单,因为您只需要soup.find(“表”),然后将该表保存到数据框中</p>
<pre><code>import pandas as pd
from bs4 import BeautifulSoup
import glob
files = glob.glob('htmls/*')
df_full = pd.DataFrame()
for file in files:
with open(file, 'r') as f:
content = f.read()
soup = BeautifulSoup(content, 'html.parser')
sp_table = soup.find("table")
df = pd.read_html(str(sp_table))[0]
df_full = df_full.append(df)
</code></pre>