<ul>
<li>这个网站有定义良好的表格标签。因此,最简单的解决方案是使用<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html" rel="nofollow noreferrer">^{<cd1>}</a>,这将把所有表刮到数据帧列表中。
<ul>
<li>如果html中没有表标记,那么<code>.read_html()</code>将不起作用</李>
</ul>
</li>
<li>因为这样可以正确地读取表,所以没有额外的转义码需要剥离或删除,但是如果一列数据需要这些转义码,像<code>df.Name = df.Name.str.strip()</code>或<code>df.Name = df.Name.str.replace('\r', '')</code>这样的代码就可以了</李>
<li>这样做的好处是将代码减少到两行,数据将更易于操作、分析和打印</li>
</ul>
<pre class="lang-py prettyprint-override"><code>import pandas as pd
url = 'https://www.hubertiming.com/results/2018MLK'
# read the tables
df_list = pd.read_html(url)
# in this case the desired dataframe is at index 1
df = df_list[1]
# display(df.head())
Place Bib Name Gender Age City State Chip Time Chip Pace Gender Place Age Group Age Group Place Time to Start Gun Time
0 1 1191 MAX RANDOLPH M 29.0 WASHINGTON DC 16:48 5:25 1 of 78 M 21-39 1 of 33 0:08 16:56
1 2 1080 NEED NAME KAISER RUNNER M 25.0 PORTLAND OR 17:31 5:39 2 of 78 M 21-39 2 of 33 0:09 17:40
2 3 1275 DAN FRANEK M 52.0 PORTLAND OR 18:15 5:53 3 of 78 M 40-54 1 of 27 0:07 18:22
3 4 1223 PAUL TAYLOR M 54.0 PORTLAND OR 18:31 5:58 4 of 78 M 40-54 2 of 27 0:07 18:38
4 5 1245 THEO KINMAN M 22.0 NaN NaN 19:31 6:17 5 of 78 M 21-39 3 of 33 0:09 19:40
# output the dataframe as an array, and see the values in the last two lists have no escape codes
data = df.to_numpy()
print(data[-2:])
[out]:
array([[190, 2087, 'LEESHA POSEY', 'F', 43.0, 'PORTLAND', 'OR',
'1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00',
'1:33:53'],
[191, 1216, 'ZULMA OCHOA', 'F', 40.0, 'GRESHAM', 'OR', '1:43:27',
'33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']],
dtype=object)
</code></pre>