我想从xml文件中创建一个矩阵或标签列表
例如,给定以下xml:
<?xml version="1.0" encoding="UTF-8"?>
Words about John.
<person Key="NameHash:001">John Master-Smith</person> is a coder.
<person Key="NameHash:002">Aleksandra</person> likes goldfish.
She likes to dance.
我可以通过python检索文本:
with open('test.txt.xml') as f:
soup = BeautifulSoup(f, 'html.parser')
text = soup.get_text()
返回:
Words about John.
John Master-Smith is a coder.
Aleksandra likes goldfish.
She likes to dance.
在此之后,我分为代币:
[['Words', 'about', 'John']
['John', 'Master-Smith', 'is', 'a', 'coder']
['Aleksandra', 'likes', 'goldfish']
['She', 'likes', 'to', 'dance']]
我想做的是将这些标记映射到一个数组,该数组指示标记是否在标记中
例如,我想返回:
[[None, None, None]
['NameHash:001', 'NameHash:001', None, None, None]
['NameHash:002', None, None]
[None, None, None, None]]
有人知道怎么做吗
我已经试过了,但不幸的是,我希望能够知道原始xml中的文本是否真的存在于标记中,而不仅仅是看到某个用户给定的字符串是否存在于xml中的标记中
with open('test.txt.xml') as f:
soup = BeautifulSoup(f, 'html.parser')
text = soup.get_text()
tag_text = [x.string for x in soup.find_all('person')]
# split into word tokens with specific tokenizer
sentences = tokenize(text)
tag_mask_all_sentences = []
for sentence in sentences:
print(sentence)
sentence_mask = []
for word in sentence:
found = False
for s in tag_text:
if word in s:
sentence_mask.append(1)
found = True
if found==False:
sentence_mask.append(None)
tag_mask_all_sentences.append(sentence_mask)
for tag_mask in tag_mask_all_sentences:
print(tag_mask)
返回:
['Words', 'about', 'John']
['John', 'Master-Smith', 'is', 'a', 'coder']
['Aleksandra', 'likes', 'goldfish']
['She', 'likes', 'to', 'dance']
[None, None, 1]
[1, 1, None, 1, 1, None]
[1, None, None]
[None, None, None, None]
你可以看到,这是不太正确的,因为第一句的'约翰'不在标签。我真的不知道“is”和“a”是怎么回事。。。我认为这是因为它发现这些字符存在于标签文本中-这显然是非常错误的
最终,我需要的输出的一种更简单的方式是这样的:w
是一个单词,0
是填充,.
是非标记的,T
是标记的。
目前没有回答
相关问题 更多 >
编程相关推荐