如何使用SPACYNLP查找专有名词

2024-05-20 07:16:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用spacy构建一个关键字提取器。我要找的关键词是以下文本中的OpTic Gaming

"The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017"

如何从该文本中解析OpTic Gaming。如果使用名词块,我得到OpTic Gaming's main sponsors sponsors,如果我得到代币,我得到[“光学”、“游戏”、“s”]

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")

for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

The company company nsubj was

OpTic Gaming's main sponsors sponsors pobj of

their first Call Call pobj to

Duty Championship Championship pobj of


Tags: ofthetospacymaincallcompanyfirst
1条回答
网友
1楼 · 发布于 2024-05-20 07:16:14

Spacy为您提取词性(专有名词、行列式、动词等)。您可以使用token.pos_在令牌级别访问它们

就你而言:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")

for tok in doc:
    print(tok, tok.pos_)

...

one NUM

of ADP

OpTic PROPN

Gaming PROPN

...

然后,您可以过滤专有名词,对连续的专有名词进行分组,并对文档进行切片以获得标称组:

def extract_proper_nouns(doc):
    pos = [tok.i for tok in doc if tok.pos_ == "PROPN"]
    consecutives = []
    current = []
    for elt in pos:
        if len(current) == 0:
            current.append(elt)
        else:
            if current[-1] == elt - 1:
                current.append(elt)
            else:
                consecutives.append(current)
                current = [elt]
    if len(current) != 0:
        consecutives.append(current)
    return [doc[consecutive[0]:consecutive[-1]+1] for consecutive in consecutives]

extract_proper_nouns(doc)

[OpTic Gaming, Duty Championship]

更多详细信息请参见:https://spacy.io/usage/linguistic-features

相关问题 更多 >