从XML标记创建原始文本

import xml.etree.ElementTree as ET xml_doc = """<?xml version="1.0" encoding="UTF-8"?> <NORMDOC> <DOC> <DOCID>112233</DOCID> <TXT> <S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S> <S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S> </TXT> </DOC> </NORMDOC> """ tree = ET.parse(xml_doc) # xml_doc is actually a file, but for reproducability it's the above xml

1条回答

网友

1楼 · 发布于 2024-06-02 15:04:20

通常我会使用纯XPath来实现：

normalize-space(//TXT)

但是，ElementTree中的XPath支持是有限的，因此只能在lxml中实现。你知道吗

要在ElementTree中实现它，我会像你在问题中链接到的答案一样；使用method="text"强制它为纯文本。您还需要规范化空白。你知道吗

示例。。。你知道吗

import xml.etree.ElementTree as ET

xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
<NORMDOC>
   <DOC>
      <DOCID>112233</DOCID>
      <TXT>
        <S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S>
        <S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S>
      </TXT>
   </DOC>
</NORMDOC>
"""

tree = ET.fromstring(xml_doc)

txt = tree.find(".//TXT")
raw_text = ET.tostring(txt, encoding='utf8', method='text').decode()
normalized_text = " ".join(raw_text.split())
print(normalized_text)

打印输出。。。你知道吗

George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222.

相关问题更多 >

编程相关推荐

热门问题

热门文章