Python:将文本拆分为单独的英语句子;保留标点符号

2024-09-24 02:23:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图创建一个函数,将字符串/文本作为参数,返回文本中的句子列表。像(.,?,!)这样的句子边界不应该被删除

我不希望它在缩写词(Dr.{}{}{}上分裂,例如"Dr. Jones")。
我应该编一本所有缩略语的词典吗


给定输入:

input = "I think Dr. Jones is busy now. Can you visit some other day? I was really surprised!"

预期输出:

output=['I think Dr. Jones is busy now.','Can you visit some other day?','I was really surprised!']

我所尝试的:

# performing somthing like this:
output = input.split('.')
# will produce
'''
['I think Dr', ' Jones is busy now', ' Can you visit some other day? I was really surprised!']
'''

# where as doing
output = input.split(' ')
# will produce
'''
['I', 'think', 'Dr.', 'Jones', 'is', 'busy', 'now.', 'Can', 'you', 'visit', 'some', 'other', 'day?', 'I', 'was', 'really', 'surprised!']
'''

基本假设是文本输入没有异常标点


Tags: youissomevisitcannowotherday
1条回答
网友
1楼 · 发布于 2024-09-24 02:23:19

实现这一目标的笨拙方法如下:

abbr = {'Dr.', 'Mr.', 'Mrs.', 'Ms.'}
sentence_ender = ['.', '?', '!']

s = "I think Dr. Jones is busy now. Can you visit some other day? I was really surprised!"

def containsAny(wrd, charList):
    # The list comprehension generates a list of True and False.
    # "1 in [ ... ]" returns true is the list has atleast 1 true, else false
    # we are essentially testing whether the word contains the sentence ender char
    return 1 in [c in wrd for c in charList]

def separate_sentences(string):
    sentences = []    # will be a list of all complete sentences
    temp = []         # will be a list of all words in current sentence

    for wrd in string.split(' '):  # the input string is split on spaces
        temp.append(wrd)           # append current word to temp

        # The following condition checks that if the word is not an abbreviation
        # yet contains any of the sentence delimiters,
        # make 'space separated' sentence and clear temp
        if wrd not in abbr and containsAny(wrd, sentence_ender):
            sentences.append(' '.join(temp))  # combine words currently in temp
            temp = []                         # clear temp, for next sentence
    return sentences


print(separate_sentences(s))

应产生:

['I think Dr. Jones is busy now.', 'Can you visit some other day?', 'I was really surprised!']

相关问题 更多 >