如何将html分割成datafram

2024-05-18 06:52:53 发布

您现在位置:Python中文网/ 问答频道 /正文

from bs4 import BeautifulSoup
import re
import pandas as pd
import os
soup_level1=BeautifulSoup(driver.page_source, 'lxml')

在导入一些HTML(例如http://www.espncricinfo.com/series/18886/scorecard/1157372/)之后,我意识到应该是一个表的东西实际上不是一个表,因此似乎我需要自己重新构建这个表

Batsmen = soup_level1.find_all('div',class_="cell batsmen")
pd.Series(Batsmen)

0     <div class="cell batsmen" data-reactid="182">B...
1     <div class="cell batsmen" data-reactid="191"><...
...
18    <div class="cell batsmen" data-reactid="541"><...
dtype: object

我可以用以下方法提取击球手的名字:

FirstBat = Batsmen[1]
FirstBat = str(FirstBat)
FirstBat = pd.Series(FirstBat)
FirstBat = FirstBat.str.split(pat = ">",expand=True)
FirstBat = FirstBat[2]
FirstBat

0    S Dhawan</a
Name: 2, dtype: object

然后我想把击球手的名字加入他的数据中,但是数据[0:4]指的是标题,数据[5:10]指的是第一个击球手:

Stats = soup_level1.find_all('div',class_="cell runs")
pd.Series(Stats)
0     <div class="cell runs" data-reactid="184">R</div>
1     <div class="cell runs" data-reactid="185">B</div>
2     <div class="cell runs" data-reactid="186">4s</...
3     <div class="cell runs" data-reactid="187">6s</...
4     <div class="cell runs" data-reactid="188">SR</...
5     <div class="cell runs" data-reactid="194">4</div>
6     <div class="cell runs" data-reactid="195">8</div>
7     <div class="cell runs" data-reactid="196">1</div>
8     <div class="cell runs" data-reactid="197">0</div>
9     <div class="cell runs" data-reactid="198">50.0... 
...
94    <div class="cell runs" data-reactid="548">-</div>
Length: 95, dtype: object

什么是最好的方法,能够额外的东西,看起来像这样

    Batsmen  R  B  4s  6s    SR
0  S Dhawan  4  8   0   0  50.0

Tags: importdivdatarunscellclassseriespd
1条回答
网友
1楼 · 发布于 2024-05-18 06:52:53

让您开始:

from bs4 import BeautifulSoup
import numpy as np
import requests

html_doc= requests.get(r'http://www.espncricinfo.com/series/18886/scorecard/1157372/').content
soup = BeautifulSoup(html_doc, 'html.parser')

data = []
for div in soup.find_all('div',class_="cell runs"):
    data.append(div.text)

np.array(data).reshape(-1,5)

哪些输出

array([['R', 'B', '4s', '6s', 'SR'],
       ['14', '12', '2', '0', '116.66'],
       ['68', '55', '5', '1', '123.63'],
       ['39', '30', '2', '2', '130.00'],
       ['2', '3', '0', '0', '66.66'],
       ['9', '6', '1', '0', '150.00'],
       ['0', '1', '0', '0', '0.00'],
       ['0', '2', '0', '0', '0.00'],
       ['1', '2', '0', '0', '50.00'],
       ['0', '1', '0', '0', '0.00'],
       ['17', '8', '2', '1', '212.50'],
       ['R', 'B', '4s', '6s', 'SR'],
       ['0', '3', '0', '0', '0.00'],
       ['4', '2', '1', '0', '200.00'],
       ['14', '15', '2', '0', '93.33'],
       ['2', '7', '0', '0', '28.57'],
       ['0', '2', '0', '0', '0.00'],
       ['1', '3', '0', '0', '33.33'],
       ['19', '23', '2', '0', '82.60'],
       ['34', '29', '6', '0', '117.24'],
       ['3', '7', '0', '0', '42.85'],
       ['6', '3', '1', '0', '200.00'],
       ['2', '7', '0', '0', '28.57']], dtype='<U6')

从这一点上说,把它放到一个数据框中并不难。您必须小心,因为这是读取多个表,您可以看到第二个“header”行(即['R', 'B', '4s', '6s', 'SR']),所以您需要决定如何处理它

相关问题 更多 >