如何在抓取网站后从字符串中删除转义码问题的回答

如何在抓取网站后从字符串中删除转义码

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<ul> <li>这个网站有定义良好的表格标签。因此，最简单的解决方案是使用<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html" rel="nofollow noreferrer">^{<cd1>}</a>，这将把所有表刮到数据帧列表中。 <ul> <li>如果html中没有表标记，那么<code>.read_html()</code>将不起作用</李> </ul> </li> <li>因为这样可以正确地读取表，所以没有额外的转义码需要剥离或删除，但是如果一列数据需要这些转义码，像<code>df.Name = df.Name.str.strip()</code>或<code>df.Name = df.Name.str.replace('\r', '')</code>这样的代码就可以了</李> <li>这样做的好处是将代码减少到两行，数据将更易于操作、分析和打印</li> </ul> <pre class="lang-py prettyprint-override"><code>import pandas as pd url = 'https://www.hubertiming.com/results/2018MLK' # read the tables df_list = pd.read_html(url) # in this case the desired dataframe is at index 1 df = df_list[1] # display(df.head()) Place Bib Name Gender Age City State Chip Time Chip Pace Gender Place Age Group Age Group Place Time to Start Gun Time 0 1 1191 MAX RANDOLPH M 29.0 WASHINGTON DC 16:48 5:25 1 of 78 M 21-39 1 of 33 0:08 16:56 1 2 1080 NEED NAME KAISER RUNNER M 25.0 PORTLAND OR 17:31 5:39 2 of 78 M 21-39 2 of 33 0:09 17:40 2 3 1275 DAN FRANEK M 52.0 PORTLAND OR 18:15 5:53 3 of 78 M 40-54 1 of 27 0:07 18:22 3 4 1223 PAUL TAYLOR M 54.0 PORTLAND OR 18:31 5:58 4 of 78 M 40-54 2 of 27 0:07 18:38 4 5 1245 THEO KINMAN M 22.0 NaN NaN 19:31 6:17 5 of 78 M 21-39 3 of 33 0:09 19:40 # output the dataframe as an array, and see the values in the last two lists have no escape codes data = df.to_numpy() print(data[-2:]) [out]: array([[190, 2087, 'LEESHA POSEY', 'F', 43.0, 'PORTLAND', 'OR', '1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00', '1:33:53'], [191, 1216, 'ZULMA OCHOA', 'F', 40.0, 'GRESHAM', 'OR', '1:43:27', '33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']], dtype=object) </code></pre>

如何在抓取网站后从字符串中删除转义码

1 个回答

相关Python问题