如何将标题格句与介词匹配？

import regex pattern = regex.compile(r"\b(?<!^)(?<=[A-Z]\w*\s?)(a(?:nd?)?|the|to|[io]n|from|with|of|for)(?!$)(?!\s?[a-z])\b|\b([A-Z]\w*)") reddit_comment= "Honestly 'The Art of War' should be required reading in schools (outside of China), it has so much wisdom packed into it that is so sorely lacking in our current education system." pattern.findall(reddit_comment)

2条回答

网友

1楼 · 编辑于 2024-09-29 23:21:49

我认为用Regex来匹配它是不可行的。你知道吗

您可以使用一个名为NLTK的包^{}并从这些标记化的单词中获取^{}，然后返回运行自定义业务逻辑的元组列表。你知道吗

import nltk

str = r"Honestly 'The Art of War' should be required reading in schools (outside of China), it has so much wisdom packed into it that is so sorely lacking in our current education system."

tagged_text = nltk.word_tokenize(str)

pos_tags = nltk.pos_tag(tagged_text)

print (pos_tags)

输出：

[
    ('Honestly', 'RB'), 
    ("'The", 'POS'), 
    ('Art', 'NNP'), 
    ('of', 'IN'), 
    ('War', 'NNP'), 
    ("'", 'POS'), 
    ('should', 'MD'), 
    ('be', 'VB'), 
    ('required', 'VBN'), 
    ('reading', 'NN'),
    ('in', 'IN'),
    ('schools', 'NNS'), 
    ('(', '('), 
    ('outside', 'IN'), 
    ('of', 'IN'), 
    ('China', 'NNP'), 
    (')', ')'), 
    (',', ','), 
    ('it', 'PRP'), 
    ('has', 'VBZ'), 
    ('so', 'RB'), 
    ('much', 'JJ'), 
    ('wisdom', 'NN'), 
    ('packed', 'VBD'), 
    ('into', 'IN'), 
    ('it', 'PRP'), 
    ('that', 'WDT'), 
    ('is', 'VBZ'), 
    ('so', 'RB'), 
    ('sorely', 'RB'), 
    ('lacking', 'VBG'), 
    ('in', 'IN'), 
    ('our', 'PRP$'), 
    ('current', 'JJ'), 
    ('education', 'NN'), 
    ('system', 'NN'), 
    ('.', '.')
]

这里'IN'表示介词。你知道吗

网友

2楼 · 编辑于 2024-09-29 23:21:49

你可以用

r'\b(?!^)[A-Z]\w*(?:\s+(?:a(?:nd?)?|the|to|[io]n|from|with|of|for|[A-Z]\w*))+\b'

参见regex demo。你知道吗

细节

\b-单词边界
(?!^)-一个否定的前瞻：这里没有字符串位置的开始
[A-Z]-大写字母
\w*-0+字母、数字或_s
(?:\s+(?:a(?:nd?)?|the|to|[io]n|from|with|of|for|[A-Z]\w*))+-非捕获组中模式的零次或多次重复：
- \s+-1+空格
- (?:a(?:nd?)?|the|to|[io]n|from|with|of|for|[A-Z]\w*)-任何
  - a(?:nd?)?-a，an，and
  - |the|to|-或the或to或
  - [io]n-in或on
  - |from|with|of|for|-或from或with或of或for
  - [A-Z]\w*-大写字母和0+个字母、数字或_s
\b-单词边界

相关问题更多 >

编程相关推荐

热门问题

热门文章