我有一个xml文件(参见下面的示例):
<DOC>
<DOCNO>7466</DOCNO>
<PROFILE>_AN-CESAXAAAFT</PROFILE>
<DATE>920518
</DATE>
<HEADLINE>
FT 18 MAY 92 / World News In Brief: Mansell drives into the history books
</HEADLINE>
<TEXT>
Britain's Nigel Mansell (left) led from start to finish in the San Marino
Grand Prix at Imola yesterday, becoming the first driver to win the first
five races of a formula one season. Mansell has a maximum 50 points in the
drivers' championship, 26 clear of second-placed Italian Riccardo Patrese.
</TEXT>
<PUB>The Financial Times
</PUB>
<PAGE>
International Page 1
</PAGE>
</DOC>
(实际文件中有许多这样的)
到目前为止,我还生成了以下代码:
import xml.etree.ElementTree as ET
import re
from stemming.porter2 import stem as PT
tree = ET.parse('articles.xml')
root = tree.getroot()
myDict = {}
for x in root:
myDict[x.find("DOCNO").text] = x.find("TEXT").text
for x in root.iter('TEXT'):
x = x.lower()
x = re.split('[^a-zA-Z]', x)
x = PT(x)
print x
基本上,我想要的是创建一个字典,其中键是DOCNO,值是文本中的文本,但一旦它被处理(在本例中,这意味着:将所有转换为小写,拆分非字母数字值,并对单词进行词干处理)
我是python的新手,所以如果有人有更好的建议,请告诉我!任何建议都将不胜感激:)
而且-每当我试图打印文本标记中的文本时,我都会得到一种奇怪的格式,我似乎无法摆脱它(\n无处不在)-知道为什么或如何修复吗?e、 g.:
{'8167': '\nTwo officers were injured when police clashed with youths on a Coventry\nestate for the second night running. Four petrol bombs were thrown at police\nvans.\n',
下面
输出
相关问题 更多 >
编程相关推荐