spaCy的regex与Python的regex不同

from spacy.matcher import Matcher nlp = spacy.load('en_core_web_lg') matcher = Matcher(nlp.vocab) pattern = [{'TEXT': {'REGEX': '\d{1}[a|p]m'}}] matcher.add('TIME', None, pattern) doc = nlp(text) matches = matcher(doc) for match_id, start, end in matches: matched_span = doc[start:end] print(matched_span.sent.text)

1条回答

网友

1楼 · 发布于 2024-10-02 16:34:32

你需要记住，这里的数字和字母是分开的，参见测试：

doc = nlp("1pm")
print([token.text for token in doc]) # => ['1', 'pm']

根据Spacy docs：

If spaCy’s tokenization doesn’t match the tokens defined in a pattern, the pattern is not going to produce any results.

您需要使用基于规则的匹配来定义自己的实体：

pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]

然后将其添加到matcher：

matcher.add('TIME', None, pattern)

找到火柴：

for match_id, start, end in matches:
    span = doc[start:end]  # The matched span
    print(span.text)

完整演示：

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

text = 'Monday to Friday 12 midnight to 5am 30% . Midnight Friday to 6am Saturday 30% . 9pm Saturday to Midnight Saturday 25% . Midnight Saturday to 6am Sunday 100% . 6am Sunday to 9pm Sunday 50%'
doc = nlp(text)

matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}, {'LOWER': {'REGEX' : '^[ap]m$'}}]
matcher.add('TIME', None, pattern)

matches = matcher(doc)
print([doc[start:end] for match_id, start, end in matches])
#=> [5am, 6am, 9pm, 6am, 6am, 9pm]

相关问题更多 >

编程相关推荐

热门问题

热门文章

spaCy的regex与Python的regex不同

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >