如何在抓取网站后从字符串中删除转义码

2024-06-16 11:51:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试在SimpleArn中使用python学习数据科学。在matplotlib学习部分,他们从here进行网页抓取

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url="https://www.hubertiming.com/results/2018MLK" #OPEN LINK
html=urlopen(URL)
soup=BeautifulSoup(html,"lxml")
title = soup.title
print (title)
print(title.text)
links = soup.find_all('a',href=True)
for link in links:
    print (link['href'])
data =[]
allrows=soup.find_all("tr")
for row in allrows:
    row_list = row.find_all("td")
    dataRow=[]
    data_converted = []
    for cell in row_list:
        dataRow.append(cell.text)
    data.append(dataRow)
data=data[4:]
print(data[-2:])

这就是结果

[['190', '2087', '\r\n\r\n                    LEESHA POSEY\r\n\r\n                ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n                    112 of 113\r\n\r\n                ', 'F 40-54', '\r\n\r\n                    36 of 37\r\n\r\n                ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n                    ZULMA OCHOA\r\n\r\n                ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n                    113 of 113\r\n\r\n                ', 'F 40-54', '\r\n\r\n                    37 of 37\r\n\r\n                ', '0:00', '1:43:27']]

我怎样才能摆脱\r\n\r\n??我已经使用了"replace"函数,它说"'list' object has no attribute 'replace'",而且我也不能使用strip


Tags: ofinimportfordatatitlematplotlibas
3条回答

你只能这样做。将代码中的cell.text转换为cell.text.strip(),如下所示:

...
for row in allrows:
    row_list = row.find_all("td")
    dataRow=[]
    data_converted = []
    for cell in row_list:
        dataRow.append(cell.text.strip())
...
  • 这个网站有定义良好的表格标签。因此,最简单的解决方案是使用^{},这将把所有表刮到数据帧列表中。
    • 如果html中没有表标记,那么.read_html()将不起作用
  • 因为这样可以正确地读取表,所以没有额外的转义码需要剥离或删除,但是如果一列数据需要这些转义码,像df.Name = df.Name.str.strip()df.Name = df.Name.str.replace('\r', '')这样的代码就可以了
  • 这样做的好处是将代码减少到两行,数据将更易于操作、分析和打印
import pandas as pd

url = 'https://www.hubertiming.com/results/2018MLK'

# read the tables
df_list = pd.read_html(url)

# in this case the desired dataframe is at index 1
df = df_list[1]

# display(df.head())
   Place   Bib                     Name Gender   Age        City State Chip Time Chip Pace Gender Place Age Group Age Group Place Time to Start Gun Time
0      1  1191             MAX RANDOLPH      M  29.0  WASHINGTON    DC     16:48      5:25      1 of 78   M 21-39         1 of 33          0:08    16:56
1      2  1080  NEED NAME KAISER RUNNER      M  25.0    PORTLAND    OR     17:31      5:39      2 of 78   M 21-39         2 of 33          0:09    17:40
2      3  1275               DAN FRANEK      M  52.0    PORTLAND    OR     18:15      5:53      3 of 78   M 40-54         1 of 27          0:07    18:22
3      4  1223              PAUL TAYLOR      M  54.0    PORTLAND    OR     18:31      5:58      4 of 78   M 40-54         2 of 27          0:07    18:38
4      5  1245              THEO KINMAN      M  22.0         NaN   NaN     19:31      6:17      5 of 78   M 21-39         3 of 33          0:09    19:40

# output the dataframe as an array, and see the values in the last two lists have no escape codes
data = df.to_numpy()
print(data[-2:])
[out]: 
array([[190, 2087, 'LEESHA POSEY', 'F', 43.0, 'PORTLAND', 'OR',
        '1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00',
        '1:33:53'],
       [191, 1216, 'ZULMA OCHOA', 'F', 40.0, 'GRESHAM', 'OR', '1:43:27',
        '33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']],
      dtype=object)

你有一个2D列表

我们在利用什么:
  1. 列表理解
  2. strip()方法
  3. 就这样:)

使用以下代码:

text = [['190', '2087', '\r\n\r\n LEESHA POSEY\r\n\r\n ', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '\r\n\r\n 112 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 36 of 37\r\n\r\n ', '0:00', '1:33:53'], ['191', '1216', '\r\n\r\n ZULMA OCHOA\r\n\r\n ', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '\r\n\r\n 113 of 113\r\n\r\n ', 'F 40-54', '\r\n\r\n 37 of 37\r\n\r\n ', '0:00', '1:43:27']]
result = [[j.strip() for j in i] for i in text]
print(result)

输出:

[['190', '2087', 'LEESHA POSEY', 'F', '43', 'PORTLAND', 'OR', '1:33:53', '30:17', '112 of 113', 'F 40-54', '36 of 37', '0:00', '1:33:53'], ['191', '1216', 'ZULMA OCHOA', 'F', '40', 'GRESHAM', 'OR', '1:43:27', '33:22', '113 of 113', 'F 40-54', '37 of 37', '0:00', '1:43:27']]

相关问题 更多 >