当html文档中缺少一个doctype时,lxml似乎添加了一个默认doctype。在
请参见以下演示代码:
import lxml.etree
import lxml.html
def beautify(html):
parser = lxml.etree.HTMLParser(
strip_cdata=True,
remove_blank_text=True
)
d = lxml.html.fromstring(html, parser=parser)
docinfo = d.getroottree().docinfo
return lxml.etree.tostring(
d,
pretty_print=True,
doctype=docinfo.doctype,
encoding='utf8'
)
with_doctype = """
<!DOCTYPE html>
<html>
<head>
<title>With Doctype</title>
</head>
</html>
"""
# This passes!
assert "DOCTYPE" in beautify(with_doctype)
no_doctype = """<html>
<head>
<title>No Doctype</title>
</head>
</html>"""
# This fails!
assert "DOCTYPE" not in beautify(no_doctype)
# because the returned html contains this line
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# which was not present in the source before
我怎么能告诉lxml不要这样做?在
这个问题最初是在这里提出的: https://github.com/mitmproxy/mitmproxy/issues/845
引用comment on reddit可能有帮助:
lxml is based on libxml2, which does this by default unless you pass the option
HTML_PARSE_NODEFDTD
, I believe. Code here.I don't know if you can tell lxml to pass that option though.. libxml has python bindings that you could perhaps use directly but they seem really hairy.
EDIT: did some more digging and that option does appear in the lxml soure here. That option does exactly what you want but I'm not sure how to activate it yet, if it's even possible.
目前在lxml中没有办法做到这一点,但是我创建了一个Pull Request on lxml,它将一个
default_doctype
布尔值添加到HTMLParser
中。在代码合并后,需要按如下方式创建解析器:
其他一切都没变。在
相关问题 更多 >
编程相关推荐