从xml文件中的标记文本创建字典（python）

<DOC> <DOCNO>7466</DOCNO> <PROFILE>_AN-CESAXAAAFT</PROFILE> <DATE>920518 </DATE> <HEADLINE> FT 18 MAY 92 / World News In Brief: Mansell drives into the history books </HEADLINE> <TEXT> Britain's Nigel Mansell (left) led from start to finish in the San Marino Grand Prix at Imola yesterday, becoming the first driver to win the first five races of a formula one season. Mansell has a maximum 50 points in the drivers' championship, 26 clear of second-placed Italian Riccardo Patrese. </TEXT> <PUB>The Financial Times </PUB> <PAGE> International Page 1 </PAGE> </DOC>

import xml.etree.ElementTree as ET import re from stemming.porter2 import stem as PT tree = ET.parse('articles.xml') root = tree.getroot() myDict = {} for x in root: myDict[x.find("DOCNO").text] = x.find("TEXT").text for x in root.iter('TEXT'): x = x.lower() x = re.split('[^a-zA-Z]', x) x = PT(x) print x

1条回答

网友

1楼 · 发布于 2024-09-29 01:38:59

下面

import xml.etree.ElementTree as ET


def _modify_text(txt):
    return txt
    # TODO implement


xml = '''<DOCS><DOC>
   <DOCNO>7466</DOCNO>
   <PROFILE>_AN-CESAXAAAFT</PROFILE>
   <DATE>920518</DATE>
   <HEADLINE>FT  18 MAY 92 / World News In Brief: Mansell drives into the history books</HEADLINE>
   <TEXT>Britain's Nigel Mansell (left) led from start to finish in the San Marino
Grand Prix at Imola yesterday, becoming the first driver to win the first
five races of a formula one season. Mansell has a maximum 50 points in the
drivers' championship, 26 clear of second-placed Italian Riccardo Patrese.</TEXT>
   <PUB>The Financial Times</PUB>
   <PAGE>International Page 1</PAGE>
</DOC></DOCS>'''

root = ET.fromstring(xml)
data = {d.find('./DOCNO').text: _modify_text(d.find('./TEXT').text.replace('\n', '')) for d in root.findall('.//DOC')}
print(data)

输出

{'7466': "Britain's Nigel Mansell (left) led from start to finish in the San MarinoGrand Prix at Imola yesterday, becoming the first driver to win the firstfive races of a formula one season. Mansell has a maximum 50 points in thedrivers' championship, 26 clear of second-placed Italian Riccardo Patrese."}

相关问题更多 >

编程相关推荐

热门问题

热门文章