将多个表从HTML转换为JSON Python

2024-10-03 04:28:05 发布

您现在位置:Python中文网/ 问答频道 /正文

这个问题是this answer的另一部分。我可以将一个HTML表转换为JSON,但当有多个表的标题不同时,结果不匹配

例如,考虑以下HTML内容:

<html>
    <body>
        <h1>My Heading</h1>
        <p>Hello world</p>
        <table>
            <tr>
                <th>Name</th>
                <th>Age</th>
                <th>License</th>
                <th>Amount</th>
            </tr>
            <tr>
                <td>John</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
            </tr>
            <tr>
                <td>Kevin</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
            </tr>
            <tr>
                <td>Smith</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
            </tr>
            <tr>
                <td>Stewart</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
            </tr>
        </table>
        <table>
            <tr>
                <th>Name2</th>
                <th>Age2</th>
                <th>License2</th>
                <th>Amount2</th>
                <th>Random</th>
            </tr>
            <tr>
                <td>Rich</td>
                <td>28</td>
                <td>Y</td>
                <td>12.30</td>
                <td>2</td>
            </tr>
            <tr>
                <td>Lou</td>
                <td>25</td>
                <td>Y</td>
                <td>22.30</td>
                <td>2</td>
            </tr>
            <tr>
                <td>Harry</td>
                <td>38</td>
                <td>Y</td>
                <td>52.20</td>
                <td>2</td>
            </tr>
            <tr>
                <td>Phil</td>
                <td>21</td>
                <td>N</td>
                <td>3.80</td>
                <td>2</td>
            </tr>
        </table>
    </body>
</html>

请注意,除了标题和段落标记之外,还有两个不同的表,它们具有不同的标题。我想把这个表转换成JSON。但是,使用下面的代码

from bs4 import BeautifulSoup
import json

if __name__ == '__main__':
    model = BeautifulSoup(xml_data, features='lxml')
    fields = []
    table_data = []
    for table in model.find_all("table"):
        for tr in table.find_all('tr', recursive=False):
            for th in tr.find_all('th', recursive=False):
                fields.append(th.text)
        for tr in table.find_all('tr', recursive=False):
            datum = {}
            for i, td in enumerate(tr.find_all('td', recursive=False)):
                datum[fields[i]] = td.text
            if datum:
                table_data.append(datum)

    print(json.dumps(table_data, indent=4))

我得到以下输出:

[
    {
        "Name": "John",
        "Age": "28",
        "License": "Y",
        "Amount": "12.30"
    },
    {
        "Name": "Kevin",
        "Age": "25",
        "License": "Y",
        "Amount": "22.30"
    },
    {
        "Name": "Smith",
        "Age": "38",
        "License": "Y",
        "Amount": "52.20"
    },
    {
        "Name": "Stewart",
        "Age": "21",
        "License": "N",
        "Amount": "3.80"
    },
    {
        "Name": "Rich",
        "Age": "28",
        "License": "Y",
        "Amount": "12.30",
        "Name2": "2"
    },
    {
        "Name": "Lou",
        "Age": "25",
        "License": "Y",
        "Amount": "22.30",
        "Name2": "2"
    },
    {
        "Name": "Harry",
        "Age": "38",
        "License": "Y",
        "Amount": "52.20",
        "Name2": "2"
    },
    {
        "Name": "Phil",
        "Age": "21",
        "License": "N",
        "Amount": "3.80",
        "Name2": "2"
    }
]

输出是不正确的,因为两个表中的头列不同,但是头在JSON的第二个集合中输出,与第一个集合相同。还要注意JSON中第二个表中的最后一列是如何完全不正确的。我希望输出为:

[
    {
        "Name": "John",
        "Age": "28",
        "License": "Y",
        "Amount": "12.30"
    },
    {
        "Name": "Kevin",
        "Age": "25",
        "License": "Y",
        "Amount": "22.30"
    },
    {
        "Name": "Smith",
        "Age": "38",
        "License": "Y",
        "Amount": "52.20"
    },
    {
        "Name": "Stewart",
        "Age": "21",
        "License": "N",
        "Amount": "3.80"
    },
    {
        "Name2": "Rich",
        "Age2": "28",
        "License2": "Y",
        "Amount2": "12.30",
        "Random": "2"
    },
    {
        "Name2": "Lou",
        "Age2": "25",
        "License2": "Y",
        "Amount2": "22.30",
        "Random": "2"
    },
    {
        "Name2": "Harry",
        "Age2": "38",
        "License2": "Y",
        "Amount2": "52.20",
        "Random": "2"
    },
    {
        "Name2": "Phil",
        "Age2": "21",
        "License2": "N",
        "Amount2": "3.80",
        "Random": "2"
    }
]

Tags: nameinforagelicensetablerandomamount
2条回答

问题在于线路

datum[fields[i]] = td.text

i只是枚举数的索引,因此它总是按照它在第一个内部循环中第一次遇到的顺序向JSON对象添加字段。这意味着它将首先使用第一个表中的标题。您需要为每个表创建一个单独的fields数组,只需将fields的声明移动到外部循环中即可,如下所示

if __name__ == '__main__':
    model = BeautifulSoup(xml_data, features='lxml')
    table_data = []
    for table in model.find_all("table"):
        fields = []
        for tr in table.find_all('tr', recursive=False):
            for th in tr.find_all('th', recursive=False):
                fields.append(th.text)
        for tr in table.find_all('tr', recursive=False):
            datum = {}
            for i, td in enumerate(tr.find_all('td', recursive=False)):
                datum[fields[i]] = td.text
            if datum:
                table_data.append(datum)

    print(json.dumps(table_data, indent=4))

这将产生所需的输出

每次迭代后,我都必须清除“th”字段列表:

from bs4 import BeautifulSoup
import json

if __name__ == '__main__':
    model = BeautifulSoup(xml_data, features='lxml')
    fields = []
    table_data = []
    for table in model.find_all("table"):
        fields.clear()
        for tr in table.find_all('tr', recursive=False):
            for th in tr.find_all('th', recursive=False):
                fields.append(th.text)
        for tr in table.find_all('tr', recursive=False):
            datum = {}
            for i, td in enumerate(tr.find_all('td', recursive=False)):
                datum[fields[i]] = td.text
            if datum:
                table_data.append(datum)

    print(json.dumps(table_data, indent=4))

相关问题 更多 >