在python中使用beautifulsoup解析表

2024-09-29 19:25:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我想遍历每一行并捕获td.文本. 但这里的问题是表没有类。所有的td都有相同的类名。我希望遍历每一行,并希望得到以下输出:

第一排)“美国人足球俱乐部”,“B11EB-美国人-B11EB-沃扎拉”,“卡梅隆·科亚”,“球员228004”,“2016-09-10”,“球员持续违反比赛规则”,“C”(新线)

第二排)“飞行员足球俱乐部”,“G12DB-AVIATORS-G12DB-REYNGOUDT”,“Saskia Reyes”,“球员224463”,“2016-09-11”,“球员/替补违反体育规则”,“C”(新线)

<div style="overflow:auto; border:1px #cccccc solid;">
<table cellspacing="0" cellpadding="3" align="left" border="0" width="100%">
    <tbody>
        <tr class="tblHeading">
            <td colspan="7">AMERICANS SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">B11EB - AMERICANS-B11EB-WARZALA</td> 
        </tr>
        <tr bgcolor="#FFFFFF">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Cameron Coya                                       </td>
            <td width="19%" class="tdUnderLine">
                Rozel, Max
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-11-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         
                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=228004" target="_blank">228004</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 
                09/10/16 02:15 PM   
            </td>
            <td width="30%" class="tdUnderLine">                player persistently infringes the laws of the game   </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>
        <tr class="tblHeading">
            <td colspan="7">AVIATORS SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">G12DB - AVIATORS-G12DB-REYNGOUDT</td> 
        </tr>
        <tr bgcolor="#FBFBFB">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Saskia Reyes                                       </td>
            <td width="19%" class="tdUnderLine">
                HollaenderNardelli, Eric
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-11-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         

                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=224463" target="_blank">224463</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 
                09/11/16 06:45 PM   
            </td>
            <td width="30%" class="tdUnderLine">                player/sub guilty of unsporting behavior     </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>
        <tr class="tblHeading">
            <td colspan="7">BERGENFIELD SOCCER CLUB</td>
        </tr>
        <tr bgcolor="#CCE4F1">
            <td colspan="7">B11CW - BERGENFIELD-B11CW-NARVAEZ</td> 
        </tr>
        <tr bgcolor="#FFFFFF">
            <td width="19%" class="tdUnderLine"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Christian Latorre                                  </td>
            <td width="19%" class="tdUnderLine">
                Coyle, Kevin
            </td>
            <td width="06%" class="tdUnderLine"> 
            09-10-2016
            </td>
            <td width="05%" class="tdUnderLine" align="center">         

                <a href="http://www.ncsanj.com/gameRefReportPrint.cfm?gid=226294" target="_blank">226294</a>    
            </td>
            <td width="16%" class="tdUnderLine" align="center"> 

                09/10/16 11:00 AM   

            </td>
            <td width="30%" class="tdUnderLine">                player persistently infringes the laws of the game   </td>
            <td class="tdUnderLine">                Cautioned    </td>
        </tr>

我试着用下面的代码。在

^{pr2}$

但很明显,它将从all td返回文本,我将无法识别特定的列名,或者无法确定新记录的开始。我想知道

1)如何识别每一列(因为类名相同)以及标题(如果您能提供代码,我将不胜感激)

2)如何识别这种结构中的新记录


Tags: thewidthtrclasstdcenter球员align
3条回答
from __future__ import print_function
import re
import datetime
from bs4 import BeautifulSoup

soup = ""
with open("/tmp/a.html") as page:
   soup = BeautifulSoup(page.read(),"html.parser")

table = soup.find('div', {'style': 'overflow:auto; border:1px #cccccc solid;'}).find('table')

trs = table.find_all('tr')

table_dict = {}
game = ""
section = ""

for tr in trs:
    if tr.has_attr('class'):
        game = tr.text.strip('\n')
    if tr.has_attr('bgcolor'):
        if tr['bgcolor'] == '#CCE4F1':
            section = tr.text.strip('\n')
        else:
            tds = tr.find_all('td')
            extracted_text = [re.sub(r'([^\x00-\x7F])+','', td.text) for td in tds]
            extracted_text = [x.strip() for x in extracted_text]
            extracted_text = list(filter(lambda x: len(x) > 2, extracted_text))
            extracted_text.pop(1)
            extracted_text[2] = "Player " + extracted_text[2]
            extracted_text[3] = datetime.datetime.strptime(extracted_text[3], '%m/%d/%y %I:%M %p').strftime("%Y-%m-%d")
            extracted_text = ['"' + x + '"' for x in [game, section] + extracted_text]
            print(','.join(extracted_text))

跑步时:

^{pr2}$

根据与OP的进一步对话,输入是https://paste.fedoraproject.org/428111/87928814/raw/,运行上述代码后的输出是:https://paste.fedoraproject.org/428110/38792211/raw/

如果数据的结构真的像一个表,那么很有可能直接用pd.read U表格(). 注意,它接受filepath_或_buffer参数中的url。 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html

count = 0
string = ""
for td in soup.find_all("td"):
string += "\""+td.text.strip()+"\","
count +=1
if(count % 9 ==0):
    print string[:-1] + "\n\n" # string[:-1] to remove the last ","
    string = ""

由于表格的格式不符合要求,我们只需使用td,而不是逐行逐行进入td,这会使工作复杂化。我刚刚使用了一个字符串,您可以将数据附加到列表列表中,然后对其进行处理以供以后使用。
希望这能解决你的问题

相关问题 更多 >

    热门问题