使用具有特殊字符的Spacy标记器时出现的问题

nlp = spacy.load("pt_core_news_sm") matcher = Matcher(nlp.vocab) text = 'total: 1,80%:(comex 1,30% + deriv 0,50%/ativo: 1,17% ' pattern_test = [{"TEXT": {"REGEX": "[0-9]+[,.]+[0-9]+[%]"}}] text_ = nlp(text) matcher.add("pattern test", [pattern_test] ) result = matcher(text_) for id_, beg, end in result: print(id_) print(text_[beg:end])

text_adjustment = text.replace(":", " ").replace("(", " ").replace(")", " ").replace("/", " ").replace(";", " ").replace("-", " ").replace("+", " ") print([token for token in text_adjustment]) ['t', 'o', 't', 'a', 'l', ' ', ' ', '1', ',', '8', '0', '%', ' ', ' ', 'c', 'o', 'm', 'e', 'x', ' ', '1', ',', '3', '0', '%', ' ', ' ', ' ', 'd', 'e', 'r', 'i', 'v', ' ', '0', ',', '5', '0', '%', ' ', 'a', 't', 'i', 'v', 'o', ' ', ' ', '1', ',', '1', '7', '%', ' ']

1条回答

网友

1楼 · 发布于 2024-10-02 20:30:41

我建议使用

import re
#...
text = re.sub(r'(\S)([/:()])', r'\1 \2', text)
pattern_test =  [{"TEXT": {"REGEX": r"^\d+[,.]\d+$"}}, {"ORTH": "%"}]

这里，(\S)([/:()])正则表达式用于匹配任何非空白（将其捕获到组1中），然后匹配/、:、(或)（将其捕获到组2中），然后re.sub在这两个组之间插入一个空格

^\d+[,.]\d+$正则表达式匹配包含浮点值的完整标记文本，%是下一个标记文本（因为数字和%被模型拆分为单独的标记）

完整的Python代码片段：

import spacy, re
from spacy.matcher import Matcher

#nlp = spacy.load("pt_core_news_sm")
nlp = spacy.load("en_core_web_trf")
matcher = Matcher(nlp.vocab)
text = 'total: 1,80%:(comex 1,30% + deriv 0,50%/ativo: 1,17% '
text = re.sub(r'(\S)([/:()])', r'\1 \2', text)
pattern_test =  [{"TEXT": {"REGEX": "\d+[,.]\d+"}}, {"ORTH": "%"}]  
text_ = nlp(text)

matcher.add("pattern test", [pattern_test] )
result = matcher(text_)

for id_, beg, end in result:
    print(id_)
    print(text_[beg:end])

输出：

9844711491635719110
1,80%
9844711491635719110
1,30%
9844711491635719110
0,50%
9844711491635719110
1,17%

相关问题更多 >

编程相关推荐

热门问题

热门文章