如何“选择”空间模式匹配的部分,而不是整个匹配?

2024-09-28 21:43:38 发布

您现在位置:Python中文网/ 问答频道 /正文

rule-based pattern matching in spaCy返回一个匹配ID以及匹配范围的开始字符和结束字符,但是我在文档中没有看到任何内容说明如何确定该范围的哪些部分构成了匹配的标记。你知道吗

在regex中,我可以在组周围放置paren来选择它们,并将它们“选中”并带出模式。这有可能吗?你知道吗

例如,我有一段文字(来自德古拉):

They wore high boots, with their trousers tucked into them, and had long black hair and heavy black moustaches.

我定义了一个实验:

import spacy
from spacy.matcher import Matcher

def test_match(text, patterns):
    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)
    matcher.add('Boots', None, patterns)

    doc = nlp(text)
    matches = matcher(doc)

    for match in matches:
        match_id, start, end = match
        string_id = nlp.vocab.strings[match_id]
        span = doc[start:end]
        print(match, span.text)

text_a = "They wore high boots, with their trousers tucked into them, " \
         "and had long black hair and heavy black moustaches."

patterns = [
    {'POS': 'PRON'},
    {'TAG': 'VBD'},
    {'POS': 'ADJ'},
    {'TAG': 'NNS'}
]

test_match(text_a, patterns)

这将输出:

(18231591219755621867, 0, 4) They wore high boots

对于这样一个简单的模式,一行有四个标记,我可以假设标记0是代词,标记1是过去时动词,等等。但是对于带有数量修饰语的模式,它变得模棱两可。但是有没有可能让spaCy告诉我哪些标记实际上与模式的组件匹配呢?你知道吗

例如,将此修改添加到上面的实验中,模式中有两个通配符,文本的新版本缺少形容词“high”:

text_b = "They wore boots, with their trousers tucked into them, " \
         "and had long black hair and heavy black moustaches."

patterns = [
    {'POS': 'PRON'},
    {'TAG': 'VBD'},
    {'POS': 'ADJ', 'OP': '*'},
    {'TAG': 'NNS', 'OP': '*'}
]

test_match(text_a, patterns)
print()
test_match(text_b, patterns)

输出:

(18231591219755621867, 0, 2) They wore
(18231591219755621867, 0, 3) They wore high
(18231591219755621867, 0, 4) They wore high boots

(18231591219755621867, 0, 2) They wore
(18231591219755621867, 0, 3) They wore boots

在这两种输出情况下,不清楚哪一个词尾标记是形容词,哪一个是复数名词。我想我可以在span中循环标记,然后手动匹配模式的搜索部分,但这肯定是重复的。既然我假设斯帕西必须找到它们来匹配它们,它就不能告诉我哪个是哪个吗?你知道吗


Tags: andtext标记postestnlpmatch模式