在NLTK中使用Stanford NER Tagger提取人员和组织的列表

3条回答

网友

1楼 · 编辑于 2024-06-02 10:26:59

由于@Vaulstein发现了link，很明显，经过训练的Stanford标记器（至少在2012年）是分布式的，不会将命名实体块化。来自the accepted answer：

Many NER systems use more complex labels such as IOB labels, where codes like B-PERS indicates where a person entity starts. The CRFClassifier class and feature factories support such labels, but they're not used in the models we currently distribute (as of 2012)

您有以下选项：

收集同一个标记的单词；例如，标记PERSON的所有相邻单词应作为一个命名实体放在一起。这很简单，但它有时会合并不同的命名实体。（例如，New York, Boston [and] Baltimore大约是三个城市，而不是一个。）编辑：这是Alvas的代码在接受的anwser中所做的。有关更简单的实现，请参见下文。
使用nltk.ne_recognize()。它不使用斯坦福识别器，但它使用块实体。（它是一个名为entity tagger的IOB的包装器）。
找出一种方法，在斯坦福tagger返回的结果基础上进行自己的分块。
为您感兴趣的域训练您自己的IOB命名实体chunker（使用斯坦福工具或NLTK的框架）。如果你有时间和资源去做正确的事情，它可能会给你最好的结果。

编辑：如果您只想拉出连续命名实体的运行（上面的选项1），您应该使用itertools.groupby：

from itertools import groupby
for tag, chunk in groupby(netagged_words, lambda x:x[1]):
    if tag != "O":
        print("%-12s"%tag, " ".join(w for w, t in chunk))

如果netagged_words是问题中的(word, type)元组列表，则会生成：

PERSON       Rami Eid
ORGANIZATION Stony Brook University
LOCATION     NY

请再次注意，如果同一类型的两个命名实体相邻出现，则此方法将组合它们。E、 g.New York, Boston [and] Baltimore是关于三个城市，而不是一个。

网友
2楼 · 编辑于 2024-06-02 10:26:59

不完全按照主题作者的要求打印他想要的东西，也许这会有帮助
listx = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')] def parser(n, string): for i in listx[n]: if i == string: pass else: return i name = parser(0,'PERSON') lname = parser(1,'PERSON') org1 = parser(5,'ORGANIZATION') org2 = parser(6,'ORGANIZATION') org3 = parser(7,'ORGANIZATION') print name, lname print org1, org2, org3
输出应该是这样的
Rami Eid Stony Brook University

网友
3楼 · 编辑于 2024-06-02 10:26:59

I O B/B I O是指Inside，Outside，Begining（IOB），或有时又称aBegining，Inside，Outside（BIO）

Stanford NE tagger返回IOB/BIO风格的标签，例如

[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

('Rami', 'PERSON'), ('Eid', 'PERSON')被标记为PERSON，“Rami”是开始或NE块，“Eid”是内部。然后你就会看到任何非NE都会被标记为“O”。

提取连续的NE chunk的想法与Named Entity Recognition with Regular Expression: NLTK非常相似，但是由于Stanford NE chunker API没有返回一个好的树来解析，因此必须执行以下操作：

def get_continuous_chunks(tagged_sent):
    continuous_chunk = []
    current_chunk = []

    for token, tag in tagged_sent:
        if tag != "O":
            current_chunk.append((token, tag))
        else:
            if current_chunk: # if the current chunk is not empty
                continuous_chunk.append(current_chunk)
                current_chunk = []
    # Flush the final current_chunk into the continuous_chunk, if any.
    if current_chunk:
        continuous_chunk.append(current_chunk)
    return continuous_chunk

ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]

named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
named_entities_str_tag = [(" ".join([token for token, tag in ne]), ne[0][1]) for ne in named_entities]

print named_entities
print
print named_entities_str
print
print named_entities_str_tag
print

[出局]：

[[('Rami', 'PERSON'), ('Eid', 'PERSON')], [('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION')], [('NY', 'LOCATION')]]

['Rami Eid', 'Stony Brook University', 'NY']

[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]

但是请注意，如果两个ne是连续的，那么它可能是错误的，尽管如此，我仍然无法想到任何两个ne之间没有“O”的连续的例子。

正如@alexis所建议的，最好将stanford NE输出转换为NLTK树：

from nltk import pos_tag
from nltk.chunk import conlltags2tree
from nltk.tree import Tree

def stanfordNE2BIO(tagged_sent):
    bio_tagged_sent = []
    prev_tag = "O"
    for token, tag in tagged_sent:
        if tag == "O": #O
            bio_tagged_sent.append((token, tag))
            prev_tag = tag
            continue
        if tag != "O" and prev_tag == "O": # Begin NE
            bio_tagged_sent.append((token, "B-"+tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag == tag: # Inside NE
            bio_tagged_sent.append((token, "I-"+tag))
            prev_tag = tag
        elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
            bio_tagged_sent.append((token, "B-"+tag))
            prev_tag = tag

    return bio_tagged_sent


def stanfordNE2tree(ne_tagged_sent):
    bio_tagged_sent = stanfordNE2BIO(ne_tagged_sent)
    sent_tokens, sent_ne_tags = zip(*bio_tagged_sent)
    sent_pos_tags = [pos for token, pos in pos_tag(sent_tokens)]

    sent_conlltags = [(token, pos, ne) for token, pos, ne in zip(sent_tokens, sent_pos_tags, sent_ne_tags)]
    ne_tree = conlltags2tree(sent_conlltags)
    return ne_tree

ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), 
('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), 
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), 
('in', 'O'), ('NY', 'LOCATION')]

ne_tree = stanfordNE2tree(ne_tagged_sent)

print ne_tree

[出局]：

  (S
  (PERSON Rami/NNP Eid/NNP)
  is/VBZ
  studying/VBG
  at/IN
  (ORGANIZATION Stony/NNP Brook/NNP University/NNP)
  in/IN
  (LOCATION NY/NNP))

然后：

ne_in_sent = []
for subtree in ne_tree:
    if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
        ne_label = subtree.label()
        ne_string = " ".join([token for token, pos in subtree.leaves()])
        ne_in_sent.append((ne_string, ne_label))
print ne_in_sent

[出局]：

[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]

相关问题更多 >

编程相关推荐

热门问题

热门文章