使用python收集表,没有在html中定义的表没有tr或td

2024-10-01 02:19:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从hydro one收集表:https://stormcentre.hydroone.com/reports/1e44c6bf-cc63-4d4a-a68e-4dbd8bb63659 它在不同的网站上运行得很好,,,,不知道熊猫是否可以从类html或角色中获取表格 似乎类是:反应虚拟化表

from bs4 import BeautifulSoup

import io
import requests
import pandas as pd
import datetime as dt

from zipfile import ZipFile

df = pd.read_html('https://stormcentre.hydroone.com/reports/1e44c6bf-cc63-4d4a-a68e-4dbd8bb63659')

for i, table in enumerate(df):
    table.insert(0, "time", dt.datetime.now(), True)
    table.to_csv('HydroExport.csv', ',', index=False, date_format='%Y-%m-%d %H:%M:%S')


print(table.to_string(index=False))

document = table.to_dict(orient='list')
print(document)

出现错误,找不到表

Traceback (most recent call last):
  File "C:\Users\medsa\OneDrive\Documents\Py\venv\lib\site-packages\IPython\core\interactiveshell.py", line 3417, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-148a9ae7cc2f>", line 1, in <module>
    runfile('C:/Users/medsa/OneDrive/Documents/Py/datapython/Ch04/04_02/ImportHydroOne.py', wdir='C:/Users/medsa/OneDrive/Documents/Py/datapython/Ch04/04_02')
  File "C:\Program Files\JetBrains\PyCharm 2020.2\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm 2020.2\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/medsa/OneDrive/Documents/Py/datapython/Ch04/04_02/ImportHydroOne.py", line 10, in <module>
    df = pd.read_html('https://stormcentre.hydroone.com/reports/1e44c6bf-cc63-4d4a-a68e-4dbd8bb63659')
  File "C:\Users\medsa\OneDrive\Documents\Py\venv\lib\site-packages\pandas\util\_decorators.py", line 296, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\medsa\OneDrive\Documents\Py\venv\lib\site-packages\pandas\io\html.py", line 1086, in read_html
    return _parse(
  File "C:\Users\medsa\OneDrive\Documents\Py\venv\lib\site-packages\pandas\io\html.py", line 917, in _parse
    raise retained
  File "C:\Users\medsa\OneDrive\Documents\Py\venv\lib\site-packages\pandas\io\html.py", line 898, in _parse
    tables = p.parse_tables()
  File "C:\Users\medsa\OneDrive\Documents\Py\venv\lib\site-packages\pandas\io\html.py", line 217, in parse_tables
    tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
  File "C:\Users\medsa\OneDrive\Documents\Py\venv\lib\site-packages\pandas\io\html.py", line 547, in _parse_tables
    raise ValueError("No tables found")
ValueError: No tables found

html代码是它的一部分

<div class="autosizer-wrapper" style="position: relative;"><div style="overflow: visible; height: 0px; width: 0px;"><div class="ReactVirtualized__Table report-table" role="grid"><div class="ReactVirtualized__Table__headerRow report-row odd" role="row" style="align-items: stretch; height: 50px; overflow: hidden; padding-right: 0px; width: 1009px;"><div aria-label="info-box-field-label-1e44c6bf-cc63-4d4a-a68e-4dbd8bb63659-name" aria-sort="ascending" class="ReactVirtualized__Table__headerColumn name ReactVirtualized__Table__sortableHeaderColumn" role="columnheader" tabindex="0" style="flex: 1 1 400px;"><div class="kubra-table-header"><div class="header-label"><div class="name">Service Area:</div></div><div class="toggle-icon"><svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="caret-up" class="svg-inline--fa fa-caret-up fa-w-10 fa-sm className fa-icon" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 320 512" data-glyph="caret-up" aria-label="caret-up"><path fill="currentColor" d="M288.662 352H31.338c-17.818 0-26.741-21.543-14.142-34.142l128.662-128.662c7.81-7.81 20.474-7.81 28.284 0l128.662 128.662c12.6 12.599 3.676 34.142-14.142 34.142z"></path></svg></div></div></div><div aria-label="info-box-field-label-1e44c6bf-cc63-4d4a-a68e-4dbd8bb63659-cust_a" class="ReactVirtualized__Table__headerColumn cust-a ReactVirtualized__Table__sortableHeaderColumn" role="columnheader" tabindex="0" style="flex: 1 1 400px;"><div class="kubra-table-header"><div class="header-label"><div class="column-icon"><svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="exclamation-triangle" class="svg-inline--fa fa-exclamation-triangle fa-w-18 fa-sm className fa-icon" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 576 512" data-glyph="exclamation-triangle" aria-label="exclamation-triangle"><path fill="currentColor" d="M569.517 440.013C587.975 472.007 564.806 512 527.94 512H48.054c-36.937 0-59.999-40.055-41.577-71.987L246.423 23.985c18.467-32.009 64.72-31.951 83.154 0l239.94 416.028zM288 354c-25.405 0-46 20.595-46 46s20.595 46 46 46 46-20.595 46-46-20.595-46-46-46zm-43.673-165.346l7.418 136c.347 6.364 5.609 11.346 11.982 11.346h48.546c6.373 0 11.635-4.982 11.982-11.346l7.418-136c.375-6.874-5.098-12.654-11.982-12.654h-63.383c-6.884 0-12.356 5.78-11.981 12.654z"></path></svg></div><div class="name">Customers Affected:</div></div></div></div><div aria-label="info-box-field-label-1e44c6bf-cc63-4d4a-a68e-4dbd8bb63659-cust_s" class="ReactVirtualized__Table__headerColumn cust-s hidden-lt-tablet ReactVirtualized__Table__sortableHeaderColumn" role="columnheader" tabindex="0" style="flex: 1 1 400px;"><div class="kubra-table-header"><div class="header-label"><div class="column-icon"><svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="user" class="svg-inline--fa fa-user fa-w-14 fa-sm className fa-icon" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 448 512" data-glyph="user" aria-label="user"><path fill="currentColor" d="M224 256c70.7 0 128-57.3 128-128S294.7 0 224 0 96 57.3 96 128s57.3 128 128 128zm89.6 32h-16.7c-22.2 10.2-46.9 16-72.9 16s-50.6-5.8-72.9-16h-16.7C60.2 288 0 348.2 0 422.4V464c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48v-41.6c0-74.2-60.2-134.4-134.4-134.4z"></path></svg></div><div class="name">Customers Served:</div></div></div></div><div aria-label="info-box-field-label-1e44c6bf-cc63-4d4a-a68e-4dbd8bb63659-etr" class="ReactVirtualized__Table__headerColumn etr ReactVirtualized__Table__sortableHeaderColumn" role="columnheader" tabindex="0" style="flex: 1 1 400px;"><div class="kubra-table-header"><div class="header-label"><div class="name">Estimated Restoration:</div></div></div></div></div><div aria-label="grid" aria-readonly="true" class="ReactVirtualized__Grid ReactVirtualized__Table__Grid" role="rowgroup" tabindex="0" style="box-sizing: border-box; direction: ltr; position: relative; width: 1009px; will-change: transform; overflow: hidden; height: 614px;"><div class="ReactVirtualized__Grid__innerScrollContainer" role="rowgroup" style="width: auto; height: 300px; max-width: 1009px; max-height: 300px; overflow: hidden; position: relative;"><div class="ReactVirtualized__Table__row report-row even" role="row" style="height: 50px; left: 0px; position: absolute; top: 0px; width: 1009px; align-items: stretch; overflow: hidden; padding-right: 0px;"><div class="ReactVirtualized__Table__rowColumn name" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class="level-1"><span class="clickable hyperlink-secondary " role="gridcell" tabindex="0">BANCROFT</span></div></div><div class="ReactVirtualized__Table__rowColumn cust-a" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" " role="gridcell" tabindex="0">222</span></div></div><div class="ReactVirtualized__Table__rowColumn cust-s hidden-lt-tablet" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" " role="gridcell" tabindex="0">26,765</span></div></div><div class="ReactVirtualized__Table__rowColumn etr" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" hidden-lt-tablet" role="gridcell" tabindex="0">Aug 26, 2020, 2:15 AM</span><button type="button" class="report-modal-button hidden-gte-tablet" aria-label="Open Data Modal" title="Open Data Modal"><svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="info" class="svg-inline--fa fa-info fa-w-6 fa-sm fa-icon" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 192 512" data-glyph="info" aria-label="info"><path fill="currentColor" d="M20 424.229h20V279.771H20c-11.046 0-20-8.954-20-20V212c0-11.046 8.954-20 20-20h112c11.046 0 20 8.954 20 20v212.229h20c11.046 0 20 8.954 20 20V492c0 11.046-8.954 20-20 20H20c-11.046 0-20-8.954-20-20v-47.771c0-11.046 8.954-20 20-20zM96 0C56.235 0 24 32.235 24 72s32.235 72 72 72 72-32.235 72-72S135.764 0 96 0z"></path></svg></button></div></div></div><div class="ReactVirtualized__Table__row report-row odd" role="row" style="height: 50px; left: 0px; position: absolute; top: 50px; width: 1009px; align-items: stretch; overflow: hidden; padding-right: 0px;"><div class="ReactVirtualized__Table__rowColumn name" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class="level-1"><span class="clickable hyperlink-secondary " role="gridcell" tabindex="0">BOWMANVILLE</span></div></div><div class="ReactVirtualized__Table__rowColumn cust-a" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" " role="gridcell" tabindex="0">Fewer than 20</span></div></div><div class="ReactVirtualized__Table__rowColumn cust-s hidden-lt-tablet" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" " role="gridcell" tabindex="0">28,879</span></div></div><div class="ReactVirtualized__Table__rowColumn etr" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" hidden-lt-tablet" role="gridcell" tabindex="0">Assessing Damage</span><button type="button" class="report-modal-button hidden-gte-tablet" aria-label="Open Data Modal" title="Open Data Modal"><svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="info" class="svg-inline--fa fa-info fa-w-6 fa-sm fa-icon" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 192 512" data-glyph="info" aria-label="info"><path fill="currentColor" d="M20 424.229h20V279.771H20c-11.046 0-20-8.954-20-20V212c0-11.046 8.954-20 20-20h112c11.046 0 20 8.954 20 20v212.229h20c11.046 0 20 8.954 20 20V492c0 11.046-8.954 20-20 20H20c-11.046 0-20-8.954-20-20v-47.771c0-11.046 8.954-20 20-20zM96 0C56.235 0 24 32.235 24 72s32.235 72 72 72 72-32.235 72-72S135.764 0 96 0z"></path></svg></button></div></div></div><div class="ReactVirtualized__Table__row report-row even" role="row" style="height: 50px; left: 0px; position: absolute; top: 100px; width: 1009px; align-items: stretch; overflow: hidden; padding-right: 0px;"><div class="ReactVirtualized__Table__rowColumn name" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class="level-1"><span class="clickable hyperlink-secondary " role="gridcell" tabindex="0">DRYDEN</span></div></div><div class="ReactVirtualized__Table__rowColumn cust-a" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" " role="gridcell" tabindex="0">Fewer than 20</span></div></div><div class="ReactVirtualized__Table__rowColumn cust-s hidden-lt-tablet" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" " role="gridcell" tabindex="0">12,132</span></div></div><div class="ReactVirtualized__Table__rowColumn etr" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" hidden-lt-tablet" role="gridcell" tabindex="0">Assessing Damage</span><button type="button" class="report-modal-button hidden-gte-tablet" aria-label="Open Data Modal" title="Open Data Modal"><svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="info" class="svg-inline--fa fa-info fa-w-6 fa-sm fa-icon" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 192 512" data-glyph="info" aria-label="info"><path fill="currentColor" d="M20 424.229h20V279.771H20c-11.046 0-20-8.954-20-20V212c0-11.046 8.954-20 20-20h112c11.046 0 20 8.954 20 20v212.229h20c11.046 0 20 8.954 20 20V492c0 11.046-8.954 20-20 20H20c-11.046 0-20-8.954-20-20v-47.771c0-11.046 8.954-20 20-20zM96 0C56.235 0 24 32.235 24 72s32.235 72 72 72 72-32.235 72-72S135.764 0 96 0z"></path></svg></button></div></div></div><div class="ReactVirtualized__Table__row report-row odd" role="row" style="height: 50px; left: 0px; position: absolute; top: 150px; width: 1009px; align-items: stretch; overflow: hidden; padding-right: 0px;"><div class="ReactVirtualized__Table__rowColumn name" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class="level-1"><span class="clickable hyperlink-secondary " role="gridcell" tabindex="0">MINDEN</span></div></div><div class="ReactVirtualized__Table__rowColumn cust-a" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" " role="gridcell" tabindex="0">699</span></div></div><div class="ReactVirtualized__Table__rowColumn cust-s hidden-lt-tablet" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" " role="gridcell" tabindex="0">20,096</span></div></div><div class="ReactVirtualized__Table__rowColumn etr" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" hidden-lt-tablet" role="gridcell" tabindex="0">Aug 26, 2020, 2:15 AM</span><button type="button" class="report-modal-button hidden-gte-tablet" aria-label="Open Data Modal" title="Open Data Modal"><svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="info" class="svg-inline--fa fa-info fa-w-6 fa-sm fa-icon" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 192 512" data-glyph="info" aria-label="info"><path fill="currentColor" d="M20 424.229h20V279.771H20c-11.046 0-20-8.954-20-20V212c0-11.046 8.954-20 20-20h112c11.046 0 20 8.954 20 20v212.229h20c11.046 0 20 8.954 20 20V492c0 11.046-8.954 20-20 20H20c-11.046 0-20-8.954-20-20v-47.771c0-11.046 8.954-20 20-20zM96 0C56.235 0 24 32.235 24 72s32.235 72 72 72 72-32.235 72-72S135.764 0 96 0z"></path></svg></button></div></div></div><div class="ReactVirtualized__Table__row report-row even" role="row" style="height: 50px; left: 0px; position: absolute; top: 200px; width: 1009px; align-items: stretch; overflow: hidden; padding-right: 0px;"><div class="ReactVirtualized__Table__rowColumn name" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class="level-1"><span class="clickable hyperlink-secondary " role="gridcell" tabindex="0">NEWMARKET</span></div></div><div class="ReactVirtualized__Table__rowColumn cust-a" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" " role="gridcell" tabindex="0">Fewer than 20</span></div></div><div class="ReactVirtualized__Table__rowColumn cust-s hidden-lt-tablet" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" " role="gridcell" tabindex="0">60,070</span></div></div><div class="ReactVirtualized__Table__rowColumn etr" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" hidden-lt-tablet" role="gridcell" tabindex="0">Assessing Damage</span><button type="button" class="report-modal-button hidden-gte-tablet" aria-label="Open Data Modal" title="Open Data Modal"><svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="info" class="svg-inline--fa fa-info fa-w-6 fa-sm fa-icon" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 192 512" data-glyph="info" aria-label="info"><path fill="currentColor" d="M20 424.229h20V279.771H20c-11.046 0-20-8.954-20-20V212c0-11.046 8.954-20 20-20h112c11.046 0 20 8.954 20 20v212.229h20c11.046 0 20 8.954 20 20V492c0 11.046-8.954 20-20 20H20c-11.046 0-20-8.954-20-20v-47.771c0-11.046 8.954-20 20-20zM96 0C56.235 0 24 32.235 24 72s32.235 72 72 72 72-32.235 72-72S135.764 0 96 0z"></path></svg></button></div></div></div><div class="ReactVirtualized__Table__row report-row odd" role="row" style="height: 50px; left: 0px; position: absolute; top: 250px; width: 1009px; align-items: stretch; overflow: hidden; padding-right: 0px;"><div class="ReactVirtualized__Table__rowColumn name" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class="level-1"><span class="clickable hyperlink-secondary " role="gridcell" tabindex="0">PICTON</span></div></div><div class="ReactVirtualized__Table__rowColumn cust-a" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" " role="gridcell" tabindex="0">Fewer than 20</span></div></div><div class="ReactVirtualized__Table__rowColumn cust-s hidden-lt-tablet" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" " role="gridcell" tabindex="0">25,636</span></div></div><div class="ReactVirtualized__Table__rowColumn etr" role="gridcell" style="flex: 1 1 400px; overflow: hidden;"><div class=""><span class=" hidden-lt-tablet" role="gridcell" tabindex="0">Reassessing</span><button type="button" class="report-modal-button hidden-gte-tablet" aria-label="Open Data Modal" title="Open Data Modal"><svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="info" class="svg-inline--fa fa-info fa-w-6 fa-sm fa-icon" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 192 512" data-glyph="info" aria-label="info"><path fill="currentColor" d="M20 424.229h20V279.771H20c-11.046 0-20-8.954-20-20V212c0-11.046 8.954-20 20-20h112c11.046 0 20 8.954 20 20v212.229h20c11.046 0 20 8.954 20 20V492c0 11.046-8.954 20-20 20H20c-11.046 0-20-8.954-20-20v-47.771c0-11.046 8.954-20 20-20zM96 0C56.235 0 24 32.235 24 72s32.235 72 72 72 72-32.235 72-72S135.764 0 96 0z"></path></svg></button></div></div></div></div></div></div></div><div class="resize-triggers"><div class="expand-trigger"><div style="width: 1010px; height: 665px;"></div></div><div class="contract-trigger"></div></div></div>

Tags: svgdivstyletablelabelhiddenclassrole
1条回答
网友
1楼 · 发布于 2024-10-01 02:19:13

您遇到的主要问题是页面是动态生成的。如果您尝试以下操作,这一点就会变得明显:

import requests
from bs4 import BeautifulSoup

target_url = 'https://stormcentre.hydroone.com/reports/1e44c6bf-cc63-4d4a-a68e-4dbd8bb63659'

resp = requests.get(target_url)
soup = BeautifulSoup(resp.text)
print(soup)

您会注意到响应中的HTML在主体中只有一个脚本标记

因此,您的第一个挑战是解析JavaScript生成的表,其中this SO thread非常方便。它描述了如何将Python包Selenium用于此任务

然而,一旦获取HTML,很明显生成的HTML也不包含HTML表。相反,该表实际上是通过使用<;部门>;元素。所以我们需要指定我们想要的div。我从this guide中得到了一些启发,它描述了如何使用元素的XPath来指定所需的元素。然后我使用了CSS选择器,如this SO threadthe official Beautiful Soup docs中所述

这里有一个建议的解决方案:

from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup

target_url = 'https://stormcentre.hydroone.com/reports/1e44c6bf-cc63-4d4a-a68e-4dbd8bb63659'
# specify XPATH (I found this 'manually' using Firefox's Inspect Element tool
tablediv_xpath = '/html/body/div/div/div/div[1]/div[2]/div[3]'
# CSS selector/class pattern of table header/row elements
header_patt = '\"sortableHeaderColumn\"'
data_patt = '\"rowColumn\"'

driver = webdriver.Firefox()
# might have to wait a bit here
driver.get(target_url)

tablediv_el = driver.find_element_by_xpath(tablediv_xpath)
tablediv_html = tablediv_el.get_attribute('innerHTML')

soup = BeautifulSoup(tablediv_html)
headers = [el.text for el in soup.select(f'div[class*={header_patt}]')]
data = [el.text for el in soup.select(f'div[class*={data_patt}]')]
n_cols = len(headers)
n_datapoints = len(data)
# put the data in an array where each row/list corresponds to 
# one row of data in the table
data_arr = [data[x:x+n_cols] for x in range(0, n_datapoints, n_cols)]
df = pd.DataFrame(data=data_arr, columns=headers)
print(df)
#  Service Area: Customers Affected: Customers Served: Estimated Restoration:
#0      BANCROFT                 222            26,765  Aug 26, 2020, 9:00 AM
#...

相关问题 更多 >