这个问题是this answer的另一部分。我可以将一个HTML表转换为JSON,但当有多个表的标题不同时,结果不匹配
例如,考虑以下HTML内容:
<html>
<body>
<h1>My Heading</h1>
<p>Hello world</p>
<table>
<tr>
<th>Name</th>
<th>Age</th>
<th>License</th>
<th>Amount</th>
</tr>
<tr>
<td>John</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
</tr>
<tr>
<td>Kevin</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
</tr>
<tr>
<td>Smith</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
</tr>
<tr>
<td>Stewart</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
</tr>
</table>
<table>
<tr>
<th>Name2</th>
<th>Age2</th>
<th>License2</th>
<th>Amount2</th>
<th>Random</th>
</tr>
<tr>
<td>Rich</td>
<td>28</td>
<td>Y</td>
<td>12.30</td>
<td>2</td>
</tr>
<tr>
<td>Lou</td>
<td>25</td>
<td>Y</td>
<td>22.30</td>
<td>2</td>
</tr>
<tr>
<td>Harry</td>
<td>38</td>
<td>Y</td>
<td>52.20</td>
<td>2</td>
</tr>
<tr>
<td>Phil</td>
<td>21</td>
<td>N</td>
<td>3.80</td>
<td>2</td>
</tr>
</table>
</body>
</html>
请注意,除了标题和段落标记之外,还有两个不同的表,它们具有不同的标题。我想把这个表转换成JSON。但是,使用下面的代码
from bs4 import BeautifulSoup
import json
if __name__ == '__main__':
model = BeautifulSoup(xml_data, features='lxml')
fields = []
table_data = []
for table in model.find_all("table"):
for tr in table.find_all('tr', recursive=False):
for th in tr.find_all('th', recursive=False):
fields.append(th.text)
for tr in table.find_all('tr', recursive=False):
datum = {}
for i, td in enumerate(tr.find_all('td', recursive=False)):
datum[fields[i]] = td.text
if datum:
table_data.append(datum)
print(json.dumps(table_data, indent=4))
我得到以下输出:
[
{
"Name": "John",
"Age": "28",
"License": "Y",
"Amount": "12.30"
},
{
"Name": "Kevin",
"Age": "25",
"License": "Y",
"Amount": "22.30"
},
{
"Name": "Smith",
"Age": "38",
"License": "Y",
"Amount": "52.20"
},
{
"Name": "Stewart",
"Age": "21",
"License": "N",
"Amount": "3.80"
},
{
"Name": "Rich",
"Age": "28",
"License": "Y",
"Amount": "12.30",
"Name2": "2"
},
{
"Name": "Lou",
"Age": "25",
"License": "Y",
"Amount": "22.30",
"Name2": "2"
},
{
"Name": "Harry",
"Age": "38",
"License": "Y",
"Amount": "52.20",
"Name2": "2"
},
{
"Name": "Phil",
"Age": "21",
"License": "N",
"Amount": "3.80",
"Name2": "2"
}
]
输出是不正确的,因为两个表中的头列不同,但是头在JSON的第二个集合中输出,与第一个集合相同。还要注意JSON中第二个表中的最后一列是如何完全不正确的。我希望输出为:
[
{
"Name": "John",
"Age": "28",
"License": "Y",
"Amount": "12.30"
},
{
"Name": "Kevin",
"Age": "25",
"License": "Y",
"Amount": "22.30"
},
{
"Name": "Smith",
"Age": "38",
"License": "Y",
"Amount": "52.20"
},
{
"Name": "Stewart",
"Age": "21",
"License": "N",
"Amount": "3.80"
},
{
"Name2": "Rich",
"Age2": "28",
"License2": "Y",
"Amount2": "12.30",
"Random": "2"
},
{
"Name2": "Lou",
"Age2": "25",
"License2": "Y",
"Amount2": "22.30",
"Random": "2"
},
{
"Name2": "Harry",
"Age2": "38",
"License2": "Y",
"Amount2": "52.20",
"Random": "2"
},
{
"Name2": "Phil",
"Age2": "21",
"License2": "N",
"Amount2": "3.80",
"Random": "2"
}
]
问题在于线路
datum[fields[i]] = td.text
i
只是枚举数的索引,因此它总是按照它在第一个内部循环中第一次遇到的顺序向JSON对象添加字段。这意味着它将首先使用第一个表中的标题。您需要为每个表创建一个单独的fields
数组,只需将fields
的声明移动到外部循环中即可,如下所示这将产生所需的输出
每次迭代后,我都必须清除“th”字段列表:
相关问题 更多 >
编程相关推荐