如何在python中从html表中获取数据

2条回答

网友

1楼 · 编辑于 2024-10-03 15:23:16

我知道这是一个老问题，但是这个任务的一个被低估的秘密是Panda的read_clipboard函数：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_clipboard.html

我认为它是在幕后使用beauthoulsoup，但是简单使用的界面非常简单。考虑一下这个简单的脚本：

# 1. Go to a website, e.g. https://www.wunderground.com/hurricane/hurrarchive.asp?region=ep
# 2. Highlight the table of data, e.g. of Hurricanes in the East Pacific
# 3. Copy the text from your browser
# 4. Run this script: the data will be available as a dataframe
import pandas as pd
df = pd.read_clipboard()
print(df)

诚然，这个解决方案需要用户交互，但在很多情况下，我发现它在没有方便的CSV下载或API端点时很有用。在

网友

2楼 · 编辑于 2024-10-03 15:23:16

您发布的html并不包含数据模型中列出的所有列字段。但是，对于它包含的字段，这将生成一个python dictionary，您可以从中获取数据模型的字段：

import urllib.request
from bs4 import BeautifulSoup

url = "the_url_of_webpage_to_scrape" # Replace with the URL of your webpage

with urllib.request.urlopen(url) as response:
    html = response.read()

soup = BeautifulSoup(html, 'html.parser')

table = soup.find("tr", attrs={"class":"even"})

btags = [str(b.text).strip().strip(':') for b in table.find_all("b")]

bsibs = [str(b.next_sibling.replace(u'\xa0', '')).strip() for b in table.find_all('b')]

data = dict(zip(btags, bsibs))

data_model = {"record_date": None, "doc_number": None, "doc_type": None, "role": None, "name": None, "apn": None, "transfer_amount": None, "county": None, "state": None}

data_model["record_date"] = data['Recording Date']
data_model['role'] = data['Grantee']

print(data_model)

输出：

^{pr2}$

有了这个你可以做：

print(data_model['record_date']) # 01/12/2016 08:05:17 AM
print(data_model['role'])        # ARELLANO ISAIAS, ARELLANO ALICIA

希望这有帮助。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何在python中从html表中获取数据

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >