Python XML.ETRe.ErEntEnter在文本中间移除空标签

2024-10-02 20:30:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个xml文档,我想从中根据标记提取文本。
我要从中提取文本的部分如下所示:

<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="­"/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>

当我这样做的时候

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = text.text

我只能抓取空标记前面的部分<TIP CONTENT="­"/>
我试图在获取其余文本之前删除此标记。
我做到了:

emptyTag = list(filter(lambda i: "TIP" in i, tags))
for e in root.iter(emptyTag) :
    root.remove(e)

但这不起作用。
<BlockText><TIP>都不是root的直接子级。


多谢各位


Tags: textin标记文本fortagtagsroot
3条回答

<TIP CONTENT="­"/>之后的文本属于它自己的尾部,而不是BlockText标记的文本

elem.text是open标记后面的文本。 elem.tail是close标记后面的文本。通常是空白,但在这种情况下,它有实际的文本

好的,这就是我的工作:

emptyTags = list(filter(lambda i: "TIP" in i, tags))
if emptyTags :
    emptyTag = list(filter(lambda i: "TIP" in i, emptyTags))[0]
for element in root.iter(emptyTag):
    print(element.tail)

但是我仍然无法得到整个文本块(相同的顺序)。我可以得到所有的BlockText标签和所有的TIP标签,但不能一起得到

更新:
我用过:

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = ''.join(text.itertext())

另一种解决方案仅供参考

from simplified_scrapy import SimplifiedDoc
html = '''
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="­"/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
'''
doc = SimplifiedDoc(html)
print (doc.select('BlockText'))
print (doc.select('BlockText>text()'))
print (doc.selects('BlockText>text()'))

结果:

{'tag': 'BlockText', 'attr1': 'blah', 'attr2': '657', 'ID': 'Bhf76', 'lang': 'en', 'html': '\nSimply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="\xad" />\n five centuries, electronic typesetting, remaining essentially release.\n'}
Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.
['Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.']

相关问题 更多 >