用Python从HTML文件获取表

2024-09-28 17:24:50 发布

您现在位置:Python中文网/ 问答频道 /正文

game_link = "http://espn.go.com/nba/playbyplay?gameId=400579510&period=0"
game_source = urlopen(game_link)
game_html = game_source.read()
game_source.close();
row = BeautifulSoup(game_html, "html.parser")
pieces = list(row.children)

我需要从上面的链接得到游戏日志行,但是上面的代码给了我完整的HTML文本,我怎样才能提取表并将它们转换成单行(片段)


Tags: comgamehttpgosourcereadhtmllink
1条回答
网友
1楼 · 发布于 2024-09-28 17:24:50

您可以尝试BeautifulSoup.findAll并提供标签以及您可能知道的有关您要查找的标签的任何其他属性。在查看页面之后,看起来您正在查找所有带有类even<tr>标记。所以你可以用soup.findAll("tr", attrs = {"class": "even"})。例如

import urllib.request
from bs4 import BeautifulSoup

game_link = "http://espn.go.com/nba/playbyplay?gameId=400579510&period=0"
game_source = urllib.request.urlopen(game_link)
game_html = game_source.read()
game_source.close();
soup = BeautifulSoup(game_html, "html.parser")
# find all instances of a row with class "even"
rows = soup.findAll("tr", attrs = {"class": "even"})
for row in rows:
    // do work
    print(row)

您仍然需要解析每一行的html。下面是一个非常“粗糙”的例子

def parse_row(row):
    cols = row.findAll("td") # get each column in the row
    # ignore timeouts, this is just an example
    if len(cols) < 4:
        return None
    else:
        return {
                "time": cols[0].get_text(),
                "team1": cols[1].get_text(),
                "score": cols[2].get_text(),
                "team2": cols[3].get_text()
               }

parsed_rows = []
for row in rows:
    parsed = parse_row(row)
    if parsed:
        parsed_rows.append(parsed)

相关问题 更多 >