如何在python上读取XML文件并将其转换为NLP工作的文本数据？

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE corpus SYSTEM "puns.dtd"> -<corpus lang="en" id="subtask2-heterographic"> -<text id="het_1"> <word id="het_1_1">'</word> <word id="het_1_2">'</word> <word id="het_1_3">I</word> <word id="het_1_4">'</word> <word id="het_1_5">m</word> <word id="het_1_6">halfway</word> <word id="het_1_7">up</word> <word id="het_1_8">a</word> <word id="het_1_9">mountain</word> <word id="het_1_10">,</word> <word id="het_1_11">'</word> <word id="het_1_12">'</word> <word id="het_1_13">Tom</word> <word id="het_1_14">alleged</word> <word id="het_1_15">.</word> </text> -<text id="het_2"> <word id="het_2_1">I</word> <word id="het_2_2">'</word> <word id="het_2_3">d</word> <word id="het_2_4">like</word> <word id="het_2_5">to</word> <word id="het_2_6">be</word> <word id="het_2_7">a</word> <word id="het_2_8">Chinese</word> <word id="het_2_9">laborer</word> <word id="het_2_10">,</word> <word id="het_2_11">said</word> <word id="het_2_12">Tom</word> <word id="het_2_13">coolly</word> <word id="het_2_14">.</word> </text> </corpus>

1条回答

网友

1楼 · 发布于 2024-06-25 23:27:54

xml文件格式不正确。删除任何xml标记（例如文本）前的“-”，并将其保存为文件，然后尝试下面的代码。所有唯一单词的列表将保存在list变量words中

import pprint as pp
import xml.etree.ElementTree as ET

root = ET.parse('XMLCorpus.xml')
words = []
for wordElement in root.iter('word'):
    words.append(wordElement.text)

pp.pprint (words)

相关问题更多 >

编程相关推荐

热门问题

热门文章