如何在抓取网站后从字符串中删除转义码

import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from urllib.request import urlopen from bs4 import BeautifulSoup import re url="https://www.hubertiming.com/results/2018MLK" #OPEN LINK html=urlopen(URL) soup=BeautifulSoup(html,"lxml") title = soup.title print (title) print(title.text) links = soup.find_all('a',href=True) for link in links: print (link['href']) data =[] allrows=soup.find_all("tr") for row in allrows: row_list = row.find_all("td") dataRow=[] data_converted = [] for cell in row_list: dataRow.append(cell.text) data.append(dataRow) data=data[4:] print(data[-2:])

[['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]

3条回答

网友

1楼 · 编辑于 2024-06-26 00:18:46

你只能这样做。将代码中的cell.text转换为cell.text.strip()，如下所示：

...
for row in allrows:
    row_list = row.find_all("td")
    dataRow=[]
    data_converted = []
    for cell in row_list:
        dataRow.append(cell.text.strip())
...

网友

2楼 · 编辑于 2024-06-26 00:18:46

这个网站有定义良好的表格标签。因此，最简单的解决方案是使用^{}，这将把所有表刮到数据帧列表中。
- 如果html中没有表标记，那么.read_html()将不起作用
因为这样可以正确地读取表，所以没有额外的转义码需要剥离或删除，但是如果一列数据需要这些转义码，像df.Name = df.Name.str.strip()或df.Name = df.Name.str.replace('\r', '')这样的代码就可以了
这样做的好处是将代码减少到两行，数据将更易于操作、分析和打印

import pandas as pd

url = 'https://www.hubertiming.com/results/2018MLK'

# read the tables
df_list = pd.read_html(url)

# in this case the desired dataframe is at index 1
df = df_list[1]

# display(df.head())
   Place   Bib                     Name Gender   Age        City State Chip Time Chip Pace Gender Place Age Group Age Group Place Time to Start Gun Time
0      1  1191             MAX RANDOLPH      M  29.0  WASHINGTON    DC     16:48      5:25      1 of 78   M 21-39         1 of 33          0:08    16:56
1      2  1080  NEED NAME KAISER RUNNER      M  25.0    PORTLAND    OR     17:31      5:39      2 of 78   M 21-39         2 of 33          0:09    17:40
2      3  1275               DAN FRANEK      M  52.0    PORTLAND    OR     18:15      5:53      3 of 78   M 40-54         1 of 27          0:07    18:22
3      4  1223              PAUL TAYLOR      M  54.0    PORTLAND    OR     18:31      5:58      4 of 78   M 40-54         2 of 27          0:07    18:38
4      5  1245              THEO KINMAN      M  22.0         NaN   NaN     19:31      6:17      5 of 78   M 21-39         3 of 33          0:09    19:40

# output the dataframe as an array, and see the values in the last two lists have no escape codes
data = df.to_numpy()
print(data[-2:])
[out]: 
array([[190, 2087, 'LEESHA POSEY', 'F', 43.0, 'PORTLAND', 'OR',
        '1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00',
        '1:33:53'],
       [191, 1216, 'ZULMA OCHOA', 'F', 40.0, 'GRESHAM', 'OR', '1:43:27',
        '33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']],
      dtype=object)

网友

3楼 · 编辑于 2024-06-26 00:18:46

你有一个2D列表

我们在利用什么：

列表理解
strip()方法
就这样：）

使用以下代码：

text = [['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
result = [[j.strip() for j in i] for i in text]
print(result)

输出：

[['190', '2087', 'LEESHA POSEY', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00', '1:33:53'], ['191', '1216', 'ZULMA OCHOA', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']]

我们在利用什么：

相关问题更多 >

编程相关推荐

热门问题

热门文章