如何有效提取<！[CDATA[]>使用python从xml中获取内容？

<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23"> <document><![CDATA["@username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING ]]></document> <document><![CDATA[Ugh ]]></document> <document><![CDATA[YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt ]]></document> <document><![CDATA[@username Shout out to me???? ]]></document> </author>

1条回答

网友

1楼 · 发布于 2024-10-01 13:38:02

这里有几件事不对劲。（关于选择库的问题在这里是违反规则的，所以我忽略了这部分问题）。在

您需要传入一个文件句柄，而不是一个文件名称。在
即：y = BeautifulSoup(open(x))
您需要告诉beauthulsoup它正在处理XML。在
即：y = BeautifulSoup(open(x), 'xml')
CDATA节不创建元素。不能在DOM中搜索它们，因为它们不存在于DOM中；它们只是语法上的糖。只需查看document下的文本，不要试图搜索名为CDATA的内容。在
再说一遍，稍微有点不同：<doc><![CDATA[foo]]</doc>与<doc>foo</doc>完全相同。关于CDATA部分的不同之处在于它里面的所有内容都是自动转义的，这意味着<![CDATA[<hello>]]被解释为<hello>。但是，您无法从解析的对象树中分辨出您的文档是包含文本为<和{}的{}部分，还是包含{}和{}的原始文本部分。这是设计的，任何兼容的xmldom实现都是如此。

现在，一些实际工作的代码如何：

import bs4

doc="""
<?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23">
    <document><![CDATA["@username: That came at the wrong time ????" HELP I'M DYING       ]]></document>
    <document><![CDATA[Ugh      ]]></document>
    <document><![CDATA[YES !!!! WE GO FOR IT.       ]]></document>
    <document><![CDATA[@username Shout out to me????        ]]></document>
</author>
"""

doc_el = bs4.BeautifulSoup(doc, 'xml')
print [ el.text for el in doc_el.findAll('document') ]

如果要读取文件，请将doc替换为open(filename, 'r')。在

相关问题更多 >

编程相关推荐

热门问题

热门文章