如何防止lxml添加默认doctyp

2024-09-28 05:24:49 发布

您现在位置:Python中文网/ 问答频道 /正文

当html文档中缺少一个doctype时,lxml似乎添加了一个默认doctype。在

请参见以下演示代码:

import lxml.etree
import lxml.html


def beautify(html):
    parser = lxml.etree.HTMLParser(
        strip_cdata=True,
        remove_blank_text=True
    )

    d = lxml.html.fromstring(html, parser=parser)
    docinfo = d.getroottree().docinfo

    return lxml.etree.tostring(
        d,
        pretty_print=True,
        doctype=docinfo.doctype,
        encoding='utf8'
    )


with_doctype = """
<!DOCTYPE html>
<html>
<head>
  <title>With Doctype</title>
</head>
</html>
"""

# This passes!
assert "DOCTYPE" in beautify(with_doctype)

no_doctype = """<html>
<head>
  <title>No Doctype</title>
</head>
</html>"""

# This fails!
assert "DOCTYPE" not in beautify(no_doctype)

# because the returned html contains this line
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# which was not present in the source before

我怎么能告诉lxml不要这样做?在

这个问题最初是在这里提出的: https://github.com/mitmproxy/mitmproxy/issues/845

引用comment on reddit可能有帮助:

lxml is based on libxml2, which does this by default unless you pass the option HTML_PARSE_NODEFDTD, I believe. Code here.

I don't know if you can tell lxml to pass that option though.. libxml has python bindings that you could perhaps use directly but they seem really hairy.

EDIT: did some more digging and that option does appear in the lxml soure here. That option does exactly what you want but I'm not sure how to activate it yet, if it's even possible.


Tags: theinyoutrueparsertitlehtmllxml
1条回答
网友
1楼 · 发布于 2024-09-28 05:24:49

目前在lxml中没有办法做到这一点,但是我创建了一个Pull Request on lxml,它将一个default_doctype布尔值添加到HTMLParser中。在

代码合并后,需要按如下方式创建解析器:

parser = lxml.etree.HTMLParser(
    strip_cdata=True,
    remove_blank_text=True,
    default_doctype=False,
)

其他一切都没变。在

相关问题 更多 >

    热门问题