在spaCy中合并自定义名词块（使用retokenize）时，如何平均向量？

def tokenise_noun_chunks(doc) if not doc.has_annotation("DEP"): return doc all_noun_chunks = list(doc.noun_chunks) + doc._.custom_noun_chunks with doc.retokenize() as retokenizer: for span in all_noun_chunks: # if I print(span.vector) here, I get the correctly averaged vector attrs = {"tag": span.root.tag, "dep": span.root.dep} retokenizer.merge(np, attrs=attrs) return doc

1条回答

网友

1楼 · 发布于 2024-05-20 02:31:53

retokenizer应该将span.vector设置为新合并令牌的向量。与spacy==3.0.3和en_core_web_md==3.0.0一起：

import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("This is a sentence.")
with doc.retokenize() as retokenizer:
    for chunk in doc.noun_chunks:
        retokenizer.merge(chunk)
for token in doc:
    print(token, token.vector[:5])

输出：

This [-0.087595  0.35502   0.063868  0.29292  -0.23635 ]
is [-0.084961   0.502      0.0023823 -0.16755    0.30721  ]
a sentence [-0.093156   0.1371495 -0.307255   0.2993     0.1383735]
. [ 0.012001  0.20751  -0.12578  -0.59325   0.12525 ]

像tag和dep这样的属性在默认情况下也设置为span.root的属性，因此如果要覆盖默认值，只需指定它们

相关问题更多 >

编程相关推荐

热门问题

热门文章