如何在标题中标记点,例如“Mr”而不是“Mr”,以及“It's”而不是“It”和“s”?

2024-06-01 19:40:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我根据NLTK文档编写了此代码:

import nltk
#define pattern
pattern = r''' (?x)     # set flag to allow verbose regexps
        (?:[A-Z]\.)+    # abbreviations, e.g. U.S.A.
        | \$?\d+(?:\.\d+)?%?    # currency and percentages, $12.40, 50%
        | \w+(?:-\w+)*  # words with internal hyphens
        | \.\.\.        # ellipsis
        |(?:Mr|Mrs|Dr|Ms)\.
        '''

sampletext = "Mr. Finch went to the bar but Dr. Liu wasn't there. It's o-k."
print(nltk.regexp_tokenize(sampletext, pattern))

输出:

['Mr', 'Finch', 'went', 'to', 'the', 'bar', 'but', 'Dr', 'Liu', 'wasn', 't', 'there', 'It', 's', 'o-k']

实际上我想把这个句子标记为:“先生”,“博士”,“它是”等等。 我使用这个:\w+(?:'\[a-z])*来处理诸如不能和它的情况,但它不起作用。请帮忙


Tags: thetobaritbutpatterntheremr
1条回答
网友
1楼 · 发布于 2024-06-01 19:40:16

您的问题是您的正则表达式的这一部分:

        | \w+(?:-\w+)*  # words with internal hyphens

正在匹配字符串中的所有普通单词,因为(?:-\w+)是可选的。这导致(例如)在到达与Mr.匹配的正则表达式部分之前匹配Mr。您需要调整正则表达式以删除这些部分的可选性,然后只匹配正则表达式末尾的普通字(当所有其他可能的匹配都失败时)。例如:

import nltk

pattern = r''' (?x)             # set flag to allow verbose regexps
        (?:[A-Z]\.)+            # abbreviations, e.g. U.S.A.
        | \$?\d+(?:\.\d+)?%?    # currency and percentages, $12.40, 50%
        | \w+(?:-\w+)+          # words with internal hyphens
        | \w+(?:'[a-z])         # words with apostrophes
        | \.\.\.                # ellipsis
        |(?:Mr|Mrs|Dr|Ms)\.     # honorifics
        | \w+                   # normal words
        '''

sampletext = "Mr. Finch went to the bar but Dr. Liu wasn't there. It's o-k."
print(nltk.regexp_tokenize(sampletext, pattern))

输出:

['Mr.', 'Finch', 'went', 'to', 'the', 'bar', 'but', 'Dr.', 'Liu', "wasn't", 'there', "It's", 'o-k']

相关问题 更多 >