靓汤:数据值与标题不匹配

2024-10-16 20:51:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我是Python的新手,我正在做一个学习项目,试图搜集一些大学足球运动员的数据。网站的源代码如下所示:

</thead>
   <tbody>


>    <tr ><th scope="row" class="right " data-stat="year_id" ><a
> href="/cfb/years/1957.html">1957</a></th><td class="left "
> data-stat="school_name" csk="San Jose State.1957" ><a
> href="/cfb/schools/san-jose-state/1957.html">San Jose
> State</a></td><td class="left " data-stat="conf_abbr" ><a
> href="/cfb/conferences/independent/1957.html">Ind</a></td><td
> class="center " data-stat="class" ></td><td class="center "
> data-stat="pos" >RB</td><td class="right " data-stat="g" >10</td><td
> class="right " data-stat="rec" >1</td><td class="right "
> data-stat="rec_yds" >6</td><td class="right "
> data-stat="rec_yds_per_rec" >6.0</td><td class="right "
> data-stat="rec_td" >0</td><td class="right " data-stat="rush_att"
> >1</td><td class="right " data-stat="rush_yds" >3</td><td class="right " data-stat="rush_yds_per_att" >3.0</td><td class="right "
> data-stat="rush_td" >0</td><td class="right " data-stat="scrim_att"
> >2</td><td class="right " data-stat="scrim_yds" >9</td><td class="right " data-stat="scrim_yds_per_att" >4.5</td><td class="right
> " data-stat="scrim_td" >0</td></tr>

以下是我在代码方面取得的进展:

headers = [item["data-stat"] for item in soup.find_all(attrs={"data-stat" : True})]
cellStrings = [cell.find(text = True) for cell in soup.findAll('td')]
print headers, cellStrings

打印出以下内容:

[u'', u'header_receiving', u'header_rushing', u'header_scrimmage', u'year_id', u'school_name', u'conf_abbr', u'class', u'pos', u'g', u'rec', u'rec_yds', u'rec_yds_per_rec', u'rec_td', u'rush_att', u'rush_yds', u'rush_yds_per_att', u'rush_td', u'scrim_att', u'scrim_yds', u'scrim_yds_per_att', u'scrim_td', u'year_id', u'school_name', u'conf_abbr', u'class', u'pos', u'g', u'rec', u'rec_yds', u'rec_yds_per_rec', u'rec_td', u'rush_att', u'rush_yds', u'rush_yds_per_att', u'rush_td', u'scrim_att', u'scrim_yds', u'scrim_yds_per_att', u'scrim_td', u'year_id', u'school_name', u'conf_abbr', u'class', u'pos', u'g', u'rec', u'rec_yds', u'rec_yds_per_rec', u'rec_td', u'rush_att', u'rush_yds', u'rush_yds_per_att', u'rush_td', u'scrim_att', u'scrim_yds', u'scrim_yds_per_att', u'scrim_td'] [u'San Jose State', u'Ind', None, u'RB', u'10', u'1', u'6', u'6.0', u'0', u'1', u'3', u'3.0', u'0', u'2', u'9', u'4.5', u'0', u'San Jose State', None, None, None, None, u'1', u'6', u'6.0', u'0', u'1', u'3', u'3.0', u'0', u'2', u'9', u'4.5', u'0']

问题是有些标题出现在源代码的前面,因此两个列表(数据和标题)不匹配。你知道吗

我的问题是,如何将“数据统计”与其相关值一起提取,而不是单独提取它们?理想的情况下,我会把它当作字典。你知道吗


Tags: rightnoneiddatayearattstatclass
1条回答
网友
1楼 · 发布于 2024-10-16 20:51:53

如果我没记错的话,您需要一个由{'data-stat-value': 'value of td'}组成的字典;您可以这样做:

data_stats = {e['data-stat']: e.get_text().strip()
              for e in html.find_all(attrs={'data-stat': True})}

这样,它肯定会提取与data-stat标记关联的文本。你知道吗

相关问题 更多 >