如何清理HTML字符串以使用lxml在python中解析它?

2024-05-19 07:57:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个python字符串,其中包含HTML代码,来自我希望使用lxml库解析的JSON。字符串有几个转义字符和其他特殊字符。如何清理此代码,以便使用lxml从中提取信息?我想在字符串上使用XPATH选择器

字符串-

<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\r\n<html>\r\n\r\n<head>\r\n    <META http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">\r\n</head>\r\n\r\n<body>\r\n\r\n<div>\r\n    <table width=\"640\" align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\" style=\"font-family:Arial,Helvetica,sans-serif;font-size:14px\">\r\n        <tr>\r\n            <td align=\"center\">\r\n\r\n                <table align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style=\"max-width:600px;text-align:left\">\r\n                    <tr>\r\n                        <td width=\"600\">\r\n                            <table align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"600\">\r\n                                <tr>\r\n                                    <td height=\"10\"></td>\r\n                                </tr>\r\n                                <tr>\r\n                                    <td align=\"center\">\r\n                                        <a href=\"#0.1_\"><img src=\"https://ns.yatracdn.com/common/images/emailers/corp-flight-hotel/yatra-logo.png\" width=\"101\" height=\"45\" alt=\"Yatra.com\" title=\"Yatra.com\" border=\"0\" style=\"font-family:Arial,Helvetica,sans-serif;font-size:25px;color:#ea2330\" vspace=\"0\" hspace=\"0\" align=\"center\"></a>\r\n                                    </td>\r\n                                </tr>\r\n                                <tr>\r\n                                    <td height=\"10\"></td>\r\n                                </tr>\r\n                                <tr>\r\n                                    <td>\r\n                                        <table border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"600\" style=\"border:1px solid #d8d8d8\">\r\n                                            <tr>\r\n                                                <td height=\"10\"></td>\r\n                                            </tr>\r\n                                            <tr>\r\n                                                <td width=\"10\"></td>\r\n                                                <td colspan=\"3\"><b>Travel Request Details</b></td>\r\n                                            </tr>\r\n                                            <tr>\r\n                                                <td height=\"10\"></td>\r\n                                            </tr>\r\n                                            <tr>\r\n                                                <td width=\"10\"></td>\r\n                                                <td>\r\n                                                    <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\" style=\"border:1px solid #d8d8d8\">\r\n                                                        <tbody>\r\n                                                        <tr>\r\n                                                            <td width=\"10\"></td>\r\n                                                            <td>\r\n                                                                <table width=\"100%\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" bgcolor=\"#ffffff\">\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Email Verification Date / Time </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">12 May 2020 17:14</td>\r\n                                                                    </tr id='aaaaa'>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Request Submission Date / Time </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">12 May 2020 17:14</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Product </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Flight</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Journey Type </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">One way</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Adult </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">1</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Child </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">0</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Infant </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">0</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Flight Class </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Travel Class</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Preferred Airline </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            </td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Non Stop Flight </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Preferred Airline</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Traveller Email </td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">ankityadav56@demo.com</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Traveller Mobile</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">9971255462</td>\r\n                                                                    </tr>\r\n                                                                    <tr>\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">\r\n                                                                            Travel Policy Email</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Corporate.traveler@yatra.com</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Origin</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">New Delhi(DEL)</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Destination</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">Mumbai(BOM)</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Depart Date</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">26 Jun 2020</td>\r\n                                                                    </tr>\r\n\r\n                                                                    <tr >\r\n                                                                        <td height=\"35\" valign=\"middle\" align=\"left\" style=\"border-bottom:1px solid #d8d8d8\">Preferred Time From</td>\r\n                                                                        <td colspan=\"2\" align=\"right\" style=\"border-bottom:1px solid #d8d8d8\">00:23</td>\r\n                                                                    </tr>\r\n\r\n                                                                </table>\r\n                                                            </td>\r\n                                                            <td width=\"10\"></td>\r\n                                                        </tr>\r\n\r\n                                                        </tbody>\r\n                                                    </table>\r\n\r\n                                                </td>\r\n                                                <td width=\"10\"></td>\r\n                                            </tr>\r\n\r\n                                            <tr>\r\n                                                <td height=\"10\"></td>\r\n                                            </tr>\r\n                                        </table>\r\n\r\n                                    </td>\r\n                                </tr>\r\n                            </table>\r\n                        </td>\r\n                    </tr>\r\n                </table>\r\n            </td>\r\n        </tr>\r\n    </table>\r\n\r\n</div>\r\n\r\n</body>\r\n\r\n</html>

对于干净字符串,解析器的工作方式如下-

>>> broken_html = "<html><head><title>test<body><h1>page title</h3>"

>>> parser = etree.HTMLParser()
>>> tree   = etree.parse(StringIO(broken_html), parser)

>>> result = etree.tostring(tree.getroot(),
...                         pretty_print=True, method="html")
>>> print(result)
<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <h1>page title</h1>
  </body>
</html>

Tags: rightmiddlestylewidthlefttrtdheight
2条回答

也许你想用BeautifulSoup?它是一个框架,用于构造代码,以便您可以对其进行迭代。您还可以搜索特定的标记、类等。 它的解析器选项之一是lxml

from bs4 import BeautifulSoup
soup = BeautifulSoup(broken_html, 'lxml')
soup.titel  # returns <title>Titel</title>
soup.find_all('div')  # returns an array with all div tags
my_tag = soup.find(id="yourID")
my_tag.find_all('div')  # returns you every div tag in the tag with the id yourID

看起来您需要首先取消对字符串的转义,因此请查看ChristopheD's answer

html_unescaped_string = html_escaped_string.decode('string_escape')

然后,您可以使用BeautifulSoup并交叉手指,它会在其他格式错误的字符串中找到它

相关问题 更多 >

    热门问题