从xml文件中的标记文本创建字典(python)

2024-09-29 01:38:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个xml文件(参见下面的示例):

<DOC>
<DOCNO>7466</DOCNO>
<PROFILE>_AN-CESAXAAAFT</PROFILE>
<DATE>920518
</DATE>
<HEADLINE>
FT  18 MAY 92 / World News In Brief: Mansell drives into the history books
</HEADLINE>
<TEXT>
Britain's Nigel Mansell (left) led from start to finish in the San Marino
Grand Prix at Imola yesterday, becoming the first driver to win the first
five races of a formula one season. Mansell has a maximum 50 points in the
drivers' championship, 26 clear of second-placed Italian Riccardo Patrese.
</TEXT>
<PUB>The Financial Times
</PUB>
<PAGE>
International Page 1
</PAGE>
</DOC>

(实际文件中有许多这样的)

到目前为止,我还生成了以下代码:

import xml.etree.ElementTree as ET
import re
from stemming.porter2 import stem as PT

tree = ET.parse('articles.xml')
root = tree.getroot()

myDict = {}
for x in root:
    myDict[x.find("DOCNO").text] = x.find("TEXT").text


for x in root.iter('TEXT'):
    x = x.lower()
    x = re.split('[^a-zA-Z]', x)
    x = PT(x)

print x

基本上,我想要的是创建一个字典,其中键是DOCNO,值是文本中的文本,但一旦它被处理(在本例中,这意味着:将所有转换为小写,拆分非字母数字值,并对单词进行词干处理)

我是python的新手,所以如果有人有更好的建议,请告诉我!任何建议都将不胜感激:)

而且-每当我试图打印文本标记中的文本时,我都会得到一种奇怪的格式,我似乎无法摆脱它(\n无处不在)-知道为什么或如何修复吗?e、 g.:

{'8167': '\nTwo officers were injured when police clashed with youths on a Coventry\nestate for the second night running. Four petrol bombs were thrown at police\nvans.\n', 

Tags: 文件thetextin文本importfordate
1条回答
网友
1楼 · 发布于 2024-09-29 01:38:59

下面

import xml.etree.ElementTree as ET


def _modify_text(txt):
    return txt
    # TODO implement


xml = '''<DOCS><DOC>
   <DOCNO>7466</DOCNO>
   <PROFILE>_AN-CESAXAAAFT</PROFILE>
   <DATE>920518</DATE>
   <HEADLINE>FT  18 MAY 92 / World News In Brief: Mansell drives into the history books</HEADLINE>
   <TEXT>Britain's Nigel Mansell (left) led from start to finish in the San Marino
Grand Prix at Imola yesterday, becoming the first driver to win the first
five races of a formula one season. Mansell has a maximum 50 points in the
drivers' championship, 26 clear of second-placed Italian Riccardo Patrese.</TEXT>
   <PUB>The Financial Times</PUB>
   <PAGE>International Page 1</PAGE>
</DOC></DOCS>'''

root = ET.fromstring(xml)
data = {d.find('./DOCNO').text: _modify_text(d.find('./TEXT').text.replace('\n', '')) for d in root.findall('.//DOC')}
print(data)

输出

{'7466': "Britain's Nigel Mansell (left) led from start to finish in the San MarinoGrand Prix at Imola yesterday, becoming the first driver to win the firstfive races of a formula one season. Mansell has a maximum 50 points in thedrivers' championship, 26 clear of second-placed Italian Riccardo Patrese."}

相关问题 更多 >