我试图将股票符号添加到被识别为组织实体的字符串中。对于每个符号,我会:
nlp.matcher.add(symbol, u'ORG', {}, [[{u'orth': symbol}]])
我可以看到这个符号被添加到图案中:
^{pr2}$但添加前未识别的符号在添加后将无法识别。显然,这些标记已经存在于词汇表中(这就是词汇表长度不变的原因)。在
我该怎么做呢?我错过了什么?在
谢谢
下面是我的示例代码:
“练习将股票行情符号添加为组织实体的简短片段”
from spacy.en import English
import spacy.en
from spacy.attrs import ORTH, TAG, LOWER, IS_ALPHA, FLAG63
import os
import csv
import sys
nlp = English() #Load everything for the English model
print "Before nlp vocab length", len(nlp.matcher.vocab)
symbol_list = [u"CHK", u"JONE", u"NE", u"DO", u"ESV"]
txt = u"""drive double-digit rallies in Chesapeake Energy (NYSE: CHK), (NYSE: NE), (NYSE: DO), (NYSE: ESV), (NYSE: JONE)"""# u"""Drive double-digit rallies in Chesapeake Energy (NYSE: CHK), Noble Corporation (NYSE:NE), Diamond Offshore (NYSE:DO), Ensco (NYSE:ESV), and Jones Energy (NYSE: JONE)"""
before = nlp(txt)
for tok in before: #Before adding entities
print tok, tok.orth, tok.tag_, tok.ent_type_
for symbol in symbol_list:
print "adding symbol:", symbol
print "vocab length:", len(nlp.matcher.vocab)
print "pattern length:", nlp.matcher.n_patterns
nlp.matcher.add(symbol, u'ORG', {}, [[{u'orth': symbol}]])
print "Patterns:", nlp.matcher._patterns
print "Entities:", nlp.matcher._entities
for ent in nlp.matcher._entities:
print ent.label
tokens = nlp(txt)
print "\n\nAfter:"
print "After nlp vocab length", len(nlp.matcher.vocab)
for tok in tokens:
print tok, tok.orth, tok.tag_, tok.ent_type_
以下是基于docs的工作示例:
->
^{pr2}$NYSE
和ESV
现在用STOCK
实体类型标记。基本上,在每个匹配中,您应该手动合并令牌和/或分配所需的实体类型。还有一个acceptor函数,允许您在匹配匹配时过滤/拒绝匹配项。在相关问题 更多 >
编程相关推荐