如何在标题中标记点，例如“Mr”而不是“Mr”，以及“It's”而不是“It”和“s”？

import nltk #define pattern pattern = r''' (?x) # set flag to allow verbose regexps (?:[A-Z]\.)+ # abbreviations, e.g. U.S.A. | \$?\d+(?:\.\d+)?%? # currency and percentages, $12.40, 50% | \w+(?:-\w+)* # words with internal hyphens | \.\.\. # ellipsis |(?:Mr|Mrs|Dr|Ms)\. ''' sampletext = "Mr. Finch went to the bar but Dr. Liu wasn't there. It's o-k." print(nltk.regexp_tokenize(sampletext, pattern))

1条回答

网友

1楼 · 发布于 2024-06-01 19:40:16

您的问题是您的正则表达式的这一部分：

        | \w+(?:-\w+)*  # words with internal hyphens

正在匹配字符串中的所有普通单词，因为(?:-\w+)是可选的。这导致（例如）在到达与Mr.匹配的正则表达式部分之前匹配Mr。您需要调整正则表达式以删除这些部分的可选性，然后只匹配正则表达式末尾的普通字（当所有其他可能的匹配都失败时）。例如：

import nltk

pattern = r''' (?x)             # set flag to allow verbose regexps
        (?:[A-Z]\.)+            # abbreviations, e.g. U.S.A.
        | \$?\d+(?:\.\d+)?%?    # currency and percentages, $12.40, 50%
        | \w+(?:-\w+)+          # words with internal hyphens
        | \w+(?:'[a-z])         # words with apostrophes
        | \.\.\.                # ellipsis
        |(?:Mr|Mrs|Dr|Ms)\.     # honorifics
        | \w+                   # normal words
        '''

sampletext = "Mr. Finch went to the bar but Dr. Liu wasn't there. It's o-k."
print(nltk.regexp_tokenize(sampletext, pattern))

输出：

['Mr.', 'Finch', 'went', 'to', 'the', 'bar', 'but', 'Dr.', 'Liu', "wasn't", 'there', "It's", 'o-k']

相关问题更多 >

编程相关推荐

热门问题

热门文章