Web刮取并从整个表的td中提取属性值,而不是文本值

2024-10-04 05:23:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从表中提取一些数据,但它们的内容实际上是我想要的属性

xml示例:

'''

<tr data-row="0">
    <th scope ="row" class="left" data_append-csv="AlleRi00" data-stat="player" csk="Allen, Ricardo">
        <a href="/players/A/AlleRi00.htm">Ricardo Allen </a>
    </th>
    <td class="center poptip out dnp" data-stat="week_4" data-tip"Out: Concussion" csk= "4">
        <strong>O</strong>
    </td>

'''

刮表时,我使用以下代码:

'''

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = 'https://www.pro-football-reference.com/teams/atl/2017_injuries.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={'class': 'sortable', 'id': 'team_injuries'})
table_rows = table.find_all('tr')

final_data = []
for tr in table_rows:
    td = tr.find_all(['th','td'])
    row = [tr.text for tr in td]
    final_data.append(row)
df = pd.DataFrame(final_data[1:],final_data[0])

'''

在我当前的代码中,我得到了一个外观良好的数据框,其中包含了标题和查看表时可见的所有信息。然而,我想在桌子上用“出去:脑震荡”而不是“O”。我已经尝试了很多方法,但都想不出来。请让我知道,在目前的流程中,这是否可行,或者我的做法是否完全错误


Tags: 数据importdatatablefindtrstatclass
1条回答
网友
1楼 · 发布于 2024-10-04 05:23:01

这将有助于您:

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = 'https://www.pro-football-reference.com/teams/atl/2017_injuries.htm'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', attrs={'class': 'sortable', 'id': 'team_injuries'})
table_rows = table.find_all('tr')

final_data = []
for tr in table_rows:
    td = tr.find_all(['th','td'])
    row = [tr['data-tip'] if tr.has_attr("data-tip") else tr.text for tr in td]

    final_data.append(row)

m = final_data[1:]
final_dataa = [[m[j][i] for j in range(len(m))] for i in range(len(m[0]))]

df = pd.DataFrame(final_dataa,final_data[0]).T

df.to_csv("D:\\injuries.csv", index = False)

csv文件的屏幕截图(我做了一些格式化,使它看起来整洁):

enter image description here

相关问题 更多 >