如何在Spacy中创建具有多个模型的NER管道

1条回答

网友

1楼 · 发布于 2024-09-28 19:04:10

主要问题是如何加载和组合管道组件，使它们使用相同的Vocab（nlp.vocab），因为管道假设所有组件共享相同的词汇，否则您可能会得到与StringStore相关的错误。在

您不应该尝试组合使用不同单词向量训练的管道组件，但只要向量相同，问题是如何从具有相同词汇的不同模型加载组件。在

使用spacy.load()无法做到这一点，因此我认为最简单的选择是用所需的voab初始化一个新的管道组件，并通过临时序列化将现有组件重新加载到新组件中。在

为了使用易于访问的模型进行简短的演示，我将演示如何将德语NER模型从de_core_news_sm添加到英文模型en_core_web_sm中，尽管这不是您通常想要做的事情：

import spacy # tested with v2.2.3
from spacy.pipeline import EntityRecognizer

text = "Jane lives in Boston. Jan lives in Bremen."

# load the English and German models
nlp_en = spacy.load('en_core_web_sm')  # NER tags PERSON, GPE, ...
nlp_de = spacy.load('de_core_news_sm') # NER tags PER, LOC, ...

# the Vocab objects are not the same
assert nlp_en.vocab != nlp_de.vocab

# but the vectors are identical (because neither model has vectors)
assert nlp_en.vocab.vectors.to_bytes() == nlp_de.vocab.vectors.to_bytes()

# original English output
doc1 = nlp_en(text)
print([(ent.text, ent.label_) for ent in doc1.ents])
# [('Jane', 'PERSON'), ('Boston', 'GPE'), ('Bremen', 'GPE')]

# original German output (the German model makes weird predictions for English text)
doc2 = nlp_de(text)
print([(ent.text, ent.label_) for ent in doc2.ents])
# [('Jane lives', 'PER'), ('Boston', 'LOC'), ('Jan lives', 'PER'), ('Bremen', 'LOC')]

# initialize a new NER component with the vocab from the English pipeline
ner_de = EntityRecognizer(nlp_en.vocab)

# reload the NER component from the German model by serializing
# without the vocab and deserializing using the new NER component
ner_de.from_bytes(nlp_de.get_pipe("ner").to_bytes(exclude=["vocab"]))

# add the German NER component to the end of the English pipeline
nlp_en.add_pipe(ner_de, name="ner_de")

# check that they have the same vocab
assert nlp_en.vocab == ner_de.vocab

# combined output (English NER runs first, German second)
doc3 = nlp_en(text)
print([(ent.text, ent.label_) for ent in doc3.ents])
# [('Jane', 'PERSON'), ('Boston', 'GPE'), ('Jan lives', 'PER'), ('Bremen', 'GPE')]

Spacy的NER组件（EntityRuler和EntityRecognizer）是为了保留任何现有的实体而设计的，因此新组件只添加了带有德语NER标记PER的{}，并按照英语NER的预测保留所有其他实体。在

您可以使用add_pipe()的选项来确定组件在管道中的插入位置。要在默认英语NER之前添加德语NER，请执行以下操作：

^{pr2}$

所有add_pipe()选项都在文档中：https://spacy.io/api/language#add_pipe

您可以将扩展管道保存为单个模型，以便下次可以用spacy.load()在一行中加载它：

nlp_en.to_disk("/path/to/model")
nlp_reloaded = spacy.load("/path/to/model")
print(nlp_reloaded.pipe_names) # ['tagger', 'parser', 'ner', 'ner_de']

相关问题更多 >

编程相关推荐

热门问题

热门文章