无法以自定义方式从表中提取某些数据

2024-05-19 20:27:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从一些html元素中解析表格内容,并以定制的方式排列它们,以便以后可以将它们相应地写入csv文件中

该表看起来几乎完全像this

Html元素类似(截断):

<tr>
    <td align="center" colspan="4" class="header">ATLANTIC</td>
</tr>
<tr>
    <td class="black10bold">Facility</td>
    <td class="black10bold">Type</td>
    <td class="black10bold">Funding</td>
</tr>
<tr>
    <td style="width: 55%">
        <a href="fsFacilityDetails.aspx?item=NJ60104"> Complete Care at Linwood, LLC </a>
    </td>
</tr>
<tr>
    <td style="width: 55%">
        <a href="fsFacilityDetails.aspx?item=NJ60102">The Health Center At Galloway</a>
    </td>
</tr>

<tr>
    <td align="center" colspan="4" class="header">BERGEN</td>
</tr>

<tr>
    <td class="black10bold">Facility</td>
    <td class="black10bold">Type</td>
    <td class="black10bold">Funding</td>
</tr>

<tr>
    <td style="width: 55%">
        <a href="fsFacilityDetails.aspx?item=30201">The Actors Fund Homes</a>
    </td>
</tr>
<tr>
    <td style="width: 55%">
        <a href="fsFacilityDetails.aspx?item=NJAL02007"> Actors Fund Home, The </a>
    </td>
</tr>

到目前为止,我已经尝试过:

for item in soup.select("tr"):
    try:
        header = item.select_one("td.header").text
    except AttributeError:
        header = ""
    try:
        item_name = item.select_one("td > a").text
    except AttributeError:
        item_name = ""
    print(item_name,header)

它产生的输出:

ATLANTIC
 
Complete Care at Linwood, LLC  
The Health Center At Galloway 

 BERGEN
 
The Actors' Fund Homes
Actors Fund Home, The 

我想要的输出:

Complete Care at Linwood, LLC  ATLANTIC
The Health Center At Galloway  ATLANTIC
The Actors' Fund Homes         BERGEN
Actors Fund Home, The          BERGEN

Tags: thestyleactorsitemwidthtrclasstd
2条回答

这将以您希望的方式生成输出

for item in soup.select("tr"):
    if item.select_one("td.header"):
        header = item.select_one("td.header").text

    elif item.select_one("td > a"):
        item_name = item.select_one("td > a").text
        print(item_name,header)

希望它能帮助你

import os
import csv
html = """<tr>
    <td align="center" colspan="4" class="header">ATLANTIC</td>
</tr>
<tr>
    <td class="black10bold">Facility</td>
    <td class="black10bold">Type</td>
    <td class="black10bold">Funding</td>
</tr>
<tr>
    <td style="width: 55%">
        <a href="fsFacilityDetails.aspx?item=NJ60104"> Complete Care at Linwood, LLC </a>
    </td>
</tr>
<tr>
    <td style="width: 55%">
        <a href="fsFacilityDetails.aspx?item=NJ60102">The Health Center At Galloway</a>
    </td>
</tr>

<tr>
    <td align="center" colspan="4" class="header">BERGEN</td>
</tr>

<tr>
    <td class="black10bold">Facility</td>
    <td class="black10bold">Type</td>
    <td class="black10bold">Funding</td>
</tr>

<tr>
    <td style="width: 55%">
        <a href="fsFacilityDetails.aspx?item=30201">The Actors Fund Homes</a>
    </td>
</tr>
<tr>
    <td style="width: 55%">
        <a href="fsFacilityDetails.aspx?item=NJAL02007"> Actors Fund Home, The </a>
    </td>
</tr>"""



soup = BeautifulSoup(html, 'lxml')
output_rows = []
for table_row in soup.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)

print(output_rows)
with open('output.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(output_rows)

相关问题 更多 >