lxml获取标签的全部内容,包括子节点和tex

2024-09-30 01:30:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从下面的XML中获取所有文本内容和标记

<title-group><article-title xml:lang="en">Correction to: Effective adsorptive performance of Fe<sub>3</sub>O<sub>4</sub>@SiO<sub>2</sub>core shell spheres for methylene blue: kinetics, isotherm and mechanism</article-title></title-group>

上面的输出应该是

Correction to: Effective adsorptive performance of Fe<sub>3</sub>O<sub>4</sub>@SiO<sub>2</sub>core shell spheres for methylene blue: kinetics, isotherm and mechanism

我试过以下方法,但它给了我不完全的价值

        s= '<title-group><article-title xml:lang="en">Correction to: Effective adsorptive performance of Fe<sub>3</sub>O<sub>4</sub>@SiO<sub>2</sub>core shell spheres for methylene blue: kinetics, isotherm and mechanism</article-title></title-group>'
        d = etree.fromstring(s)
        title_xpath = '/title-group/article-title'
        title = ""
        if not d.xpath(title_xpath)[0].getchildren():
            title = d.xpath(title_xpath)[0].text
        else:
            for title_elem in d.xpath(title_xpath):
                title_parts = title_elem.getchildren()
                title = ''.join(etree.tostring(part, encoding="unicode") for part in title_parts)
        print(title)

上面的代码给了我

<sub>3</sub>O<sub>4</sub>@SiO<sub>2</sub>core shell spheres for methylene blue: kinetics, isotherm and mechanism

Tags: andcorefortitlearticlegroupblueshell
2条回答

可能获取元素并从中提取文本\u content()

从xml树“d”开始(这只是我的想法,不是很漂亮,但是如果它能满足您的需要,请告诉我):

text = ""
for element in list(d.iterchildren("title-group")): # iterate over elements with tag = "title-group"
    try:
        text += element.text_content() # get text, placed in a try-except just incase the element doesn't have the text_content() method
    except:
        continue
print(text)

你可以试试BeautifulSoup

>>> s= '<title-group><article-title xml:lang="en">Correction to: Effective adsorptive performance of Fe<sub>3</sub>O<sub>4</sub>@SiO<sub>2</sub>core shell spheres for methylene blue: kinetics, isotherm and mechanism</article-title></title-group>'

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s, 'lxml')
>>> soup.getText()
'Correction to: Effective adsorptive performance of Fe3O4@SiO2core shell spheres for methylene blue: kinetics, isotherm and mechanism'

相关问题 更多 >

    热门问题