Python XML.ETRe.ErEntEnter在文本中间移除空标签

<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en"> Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT=""/>\n five centuries, electronic typesetting, remaining essentially release. </BlockText>

tree = ET.parse("myfile.xml") root = tree.getroot() tags = list(set([elem.tag for elem in root.iter()])) tag = list(filter(lambda i: "BlockText" in i, tags))[0] for text in root.iter(tag): texte = text.text

3条回答

网友

1楼 · 编辑于 2024-10-02 20:30:46

<TIP CONTENT=""/>之后的文本属于它自己的尾部，而不是BlockText标记的文本

elem.text是open标记后面的文本。 elem.tail是close标记后面的文本。通常是空白，但在这种情况下，它有实际的文本

网友

2楼 · 编辑于 2024-10-02 20:30:46

好的，这就是我的工作：

emptyTags = list(filter(lambda i: "TIP" in i, tags))
if emptyTags :
    emptyTag = list(filter(lambda i: "TIP" in i, emptyTags))[0]
for element in root.iter(emptyTag):
    print(element.tail)

但是我仍然无法得到整个文本块（相同的顺序）。我可以得到所有的BlockText标签和所有的TIP标签，但不能一起得到

更新：
我用过：

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = ''.join(text.itertext())

网友

3楼 · 编辑于 2024-10-02 20:30:46

另一种解决方案仅供参考

from simplified_scrapy import SimplifiedDoc
html = '''
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT=""/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
'''
doc = SimplifiedDoc(html)
print (doc.select('BlockText'))
print (doc.select('BlockText>text()'))
print (doc.selects('BlockText>text()'))

结果:

{'tag': 'BlockText', 'attr1': 'blah', 'attr2': '657', 'ID': 'Bhf76', 'lang': 'en', 'html': '\nSimply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="\xad" />\n five centuries, electronic typesetting, remaining essentially release.\n'}
Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.
['Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.']

相关问题更多 >

编程相关推荐

热门问题

热门文章