Beautifulsoup span id标签给Pandas

2024-06-28 18:51:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下html:

</tr><tr>
<td>
<span id="Grid_exdate_43">2/15/2005</span>
</td><td>Cash</td><td>
<span id="Grid_CashAmount_43">0.08</span>
</td><td>
<span id="Grid_DeclDate_43">--</span>
</td><td>
<span id="Grid_RecDate_43">2/17/2005</span>
</td><td>
<span id="Grid_PayDate_43">3/10/2005</span>
</td>
</tr><tr>
<td>
<span id="Grid_exdate_44">11/15/2004</span>
</td><td>Cash</td><td>
<span id="Grid_CashAmount_44">3.08</span>
</td><td>
<span id="Grid_DeclDate_44">--</span>
</td><td>
<span id="Grid_RecDate_44">11/17/2004</span>
</td><td>
<span id="Grid_PayDate_44">12/2/2004</span>
</td>
</tr><tr>

每个部分有相同的5个项目,即:Grid_exdateGrid_CashAmountGrid_DeclDateGrid_RecDateGrid_PayDate。每个部分的每个id后面都有一个整数,每个部分的值都会递增。在上面的例子中,我们有第43节和第44节。在

我需要能够将每个部分保存为熊猫数据帧中的一行。数据帧如下:

^{pr2}$

我不知道该怎么做。在

编辑:

好吧,我已经找到了一个可行的方法:

def get_exdate(self, id):
    return id and re.compile("Grid_exdate_").search(id)

df = pd.DataFrame()
exdate_list = []
for link in soup.find_all(id=self.get_exdate):
    exdate_list.append(link.string)

df['Grid_exdate'] = exdate_list

因此,上面的代码使用正则表达式获取所有Grid_exdate_值,将所有结果添加到列表中,然后将其作为列添加到dataframe中。在

所以我只需创建5个,每个字段一个。如果有人有更好的解决方案,请告诉我(这可能不是一个非常有效的方法)。否则这就可以了。在


Tags: 数据方法idgetcashtrlistgrid
3条回答

您可以使用docs中的pandas^{}

This function searches for <table> elements and only for <tr> and <th> rows and <td> elements within each <tr> or <th> element in the table. <td> stands for “table data”.

因此,在使用文件之前,您需要用<table>标记对其进行包装:

<table>
your html
</table>

然后使用第一个元素,因为read_html从html读取表到列表:

^{pr2}$

编辑

如果要重命名列:

df1 = df[0]
df1.columns = ["Grid_exdate", "Cash", "Grid_CashAmount", "Grid_DeclDate", "Grid_RecDate", "Grid_PayDate"]

您将有'Cash'列,因为它作为单独的表单元:

In [494]: df1
Out[494]:
  Grid_exdate  Cash  Grid_CashAmount Grid_DeclDate Grid_RecDate Grid_PayDate
0   2/15/2005  Cash             0.08                 2/17/2005    3/10/2005
1  11/15/2004  Cash             3.08                11/17/2004    12/2/2004

然后可以删除“Cash”列或编辑初始表

In [496]: df1.drop('Cash', axis=1)
Out[496]:
  Grid_exdate  Grid_CashAmount Grid_DeclDate Grid_RecDate Grid_PayDate
0   2/15/2005             0.08                 2/17/2005    3/10/2005
1  11/15/2004             3.08                11/17/2004    12/2/2004

如果您不想使用pandas read_html,可以对其进行更复杂的解析:

import pandas as pd
from bs4 import BeautifulSoup

table = BeautifulSoup(open('test.html','r').read())

#generate header from first tr
h   = [[td.span.get('id') for td in row.select('td') if td.span != None ]
             for row in table.findAll('tr')]
#remove empty lists
h = [x for x in h if x != []]               
header = h[0]
print header
['Grid_exdate_43', 'Grid_CashAmount_43', 'Grid_DeclDate_43', 'Grid_RecDate_43', 'Grid_PayDate_43']

#if generating header is problematic, you can specify them
#header = ['Grid_exdate', 'Grid_CashAmount', 'Grid_DeclDate', 'Grid_RecDate', 'Grid_PayDate' ]

#get content of table, remove td with text Cash 
body   = [[td.text.strip() for td in row.select('td') if td.text.strip() != 'Cash']
             for row in table.findAll('tr')]
#remove empty lists
body = [x for x in body if x != []]                 

cols = zip(*body)

tbl_d  = {name:col for name, col in zip(header,cols)}

df = pd.DataFrame(tbl_d, columns = header)
^{pr2}$

感谢大家提出的解决方案。最后,我采用了以下似乎是最简单的解决方案:

def get_exdate(self, id):
    return id and re.compile("Grid_exdate_").search(id)

df = pd.DataFrame()
exdate_list = []
for link in soup.find_all(id=self.get_exdate):
    exdate_list.append(link.string)

df['Grid_exdate'] = exdate_list

这将使用re.compile在html/soup中搜索以Grid_exdate_开头的所有内容。然后将结果添加到数据帧中。因此,我只为每个必需字段创建了一个re.compile搜索,并将它们全部添加到具有正确列标题的dataframe中。在

相关问题 更多 >