从NLTK表单分块Stanford命名实体识别器（NER）输出

3条回答

网友

1楼 · 编辑于 2024-10-01 13:28:02

它看起来很长，但却起作用：

ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
chunked, pos = [], ""
for i, word_pos in enumerate(ner_output):
    word, pos = word_pos
    if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
        chunked[-1]+=word_pos
    else:
        chunked.append(word_pos)
    prev_tag = pos

clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]]) if len(wordpos)!=2 else wordpos for wordpos in chunked]

print clean_chunked

[出来]：

^{pr2}$

有关详细信息：

第一个for循环“with memory”实现如下效果：

[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')]

您将认识到，所有名称元素在一个元组中都将有超过2个项，而您需要的是作为列表中元素的单词，即'Republican Party'在{}中，因此您将执行以下操作以获得偶数元素：

>>> x = [0,1,2,3,4,5,6]
>>> x[::2]
[0, 2, 4, 6]
>>> x[1::2]
[1, 3, 5]

然后您还意识到，元素元组中的最后一个元素就是您想要的标记，所以您应该

>>> x = (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')
>>> x[::2]
(u'Republican', u'Party')
>>> x[-1]
u'ORGANIZATION'

这是一个有点即兴和恼火，但我希望它有帮助。这是一个功能，祝福圣诞：

ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]


def rechunk(ner_output):
    chunked, pos = [], ""
    for i, word_pos in enumerate(ner_output):
        word, pos = word_pos
        if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
            chunked[-1]+=word_pos
        else:
            chunked.append(word_pos)
        prev_tag = pos


    clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]]) 
                    if len(wordpos)!=2 else wordpos for wordpos in chunked]

    return clean_chunked


print rechunk(ner_output)

网友

2楼 · 编辑于 2024-10-01 13:28:02

您可以使用标准的NLTK方法来表示块，使用nltk.树。这可能意味着你必须改变你的表现方式。在

我通常所做的是将带有NER标记的句子表示为三元组列表：

sentence = [('Andrew', 'NNP', 'PERSON'), ('is', 'VBZ', 'O'), ('part', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('Republican', 'NNP', 'ORGANIZATION'), ('Party', 'NNP', 'ORGANIZATION'), ('in', 'IN', 'O'), ('Dallas', 'NNP', 'LOCATION')]

当我使用外部工具对句子进行标记时，我会这样做。现在您可以将这个句子转换为NLTK表示：

^{pr2}$

这种表示方式的改变是有意义的，因为您肯定需要POS标记来进行NER标记。在

最终结果应该是：

(S
  (PERSON Andrew/NNP)
  is/VBZ
  part/NN
  of/IN
  the/DT
  (ORGANIZATION Republican/NNP Party/NNP)
  in/IN
  (LOCATION Dallas/NNP))

网友

3楼 · 编辑于 2024-10-01 13:28:02

这实际上是在CoreNLP的下一个版本中出现的，名称是^{}。不过，除非NLTK的人希望支持它和标准的Stanford-NER接口，否则它可能不会直接从NLTK获得。在

在任何情况下，目前您必须复制我链接到的代码（它使用^{}完成脏工作）或用Python编写自己的后处理器。在

相关问题更多 >

编程相关推荐

热门问题

热门文章