如何将标题格句与介词匹配?

2024-09-29 23:21:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我想使用regex从文档中提取标题大小写句子。 我希望我的正则表达式在介词都大写或不大写时与句子匹配。你知道吗

例如,我希望它匹配:

The Art of War

The Art Of War

我试过在Reddit评论中使用几个regex表达式,但是我总是找不到正确的句子,因为我发现了很多误报。你知道吗

我在Python中尝试了这个正则表达式:

import regex
pattern = regex.compile(r"\b(?<!^)(?<=[A-Z]\w*\s?)(a(?:nd?)?|the|to|[io]n|from|with|of|for)(?!$)(?!\s?[a-z])\b|\b([A-Z]\w*)")
reddit_comment= "Honestly 'The Art of War' should be required reading in schools (outside of China), it has so much wisdom packed into it that is so sorely lacking in our current education system."
pattern.findall(reddit_comment)

我原以为它能取回《孙子兵法》,但我却得到了:

[('', 'Honestly'),
 ('', 'The'),
 ('', 'Art'),
 ('of', ''),
 ('', 'War'),
 ('', 'China')]


Tags: oftheinsocommentitregex句子
2条回答

我认为用Regex来匹配它是不可行的。你知道吗

您可以使用一个名为NLTK的包^{}并从这些标记化的单词中获取^{},然后返回运行自定义业务逻辑的元组列表。你知道吗

import nltk

str = r"Honestly 'The Art of War' should be required reading in schools (outside of China), it has so much wisdom packed into it that is so sorely lacking in our current education system."

tagged_text = nltk.word_tokenize(str)

pos_tags = nltk.pos_tag(tagged_text)

print (pos_tags)

输出:

[
    ('Honestly', 'RB'), 
    ("'The", 'POS'), 
    ('Art', 'NNP'), 
    ('of', 'IN'), 
    ('War', 'NNP'), 
    ("'", 'POS'), 
    ('should', 'MD'), 
    ('be', 'VB'), 
    ('required', 'VBN'), 
    ('reading', 'NN'),
    ('in', 'IN'),
    ('schools', 'NNS'), 
    ('(', '('), 
    ('outside', 'IN'), 
    ('of', 'IN'), 
    ('China', 'NNP'), 
    (')', ')'), 
    (',', ','), 
    ('it', 'PRP'), 
    ('has', 'VBZ'), 
    ('so', 'RB'), 
    ('much', 'JJ'), 
    ('wisdom', 'NN'), 
    ('packed', 'VBD'), 
    ('into', 'IN'), 
    ('it', 'PRP'), 
    ('that', 'WDT'), 
    ('is', 'VBZ'), 
    ('so', 'RB'), 
    ('sorely', 'RB'), 
    ('lacking', 'VBG'), 
    ('in', 'IN'), 
    ('our', 'PRP$'), 
    ('current', 'JJ'), 
    ('education', 'NN'), 
    ('system', 'NN'), 
    ('.', '.')
]

这里'IN'表示介词。你知道吗

你可以用

r'\b(?!^)[A-Z]\w*(?:\s+(?:a(?:nd?)?|the|to|[io]n|from|with|of|for|[A-Z]\w*))+\b'

参见regex demo。你知道吗

细节

  • \b-单词边界
  • (?!^)-一个否定的前瞻:这里没有字符串位置的开始
  • [A-Z]-大写字母
  • \w*-0+字母、数字或_s
  • (?:\s+(?:a(?:nd?)?|the|to|[io]n|from|with|of|for|[A-Z]\w*))+-非捕获组中模式的零次或多次重复:
    • \s+-1+空格
    • (?:a(?:nd?)?|the|to|[io]n|from|with|of|for|[A-Z]\w*)-任何
      • a(?:nd?)?-aanand
      • |the|to|-或theto
      • [io]n-inon
      • |from|with|of|for|-或fromwithoffor
      • [A-Z]\w*-大写字母和0+个字母、数字或_s
  • \b-单词边界

相关问题 更多 >

    热门问题