考虑两个连续单词作为一个词频

low_case = df['Sentence'].str.lower().str.cat(sep=' ') words = nltk.tokenize.word_tokenize(low_case) word_dist = nltk.FreqDist(words) example = pd.DataFrame(word_dist.most_common(1000), columns=['Word', 'Freq'])

1条回答

网友

1楼 · 发布于 2024-09-24 00:22:22

您可以进行一些预处理，并使用re Match Objects将双格从句子的其余部分分离出来。例如：

import re

# initialize sentence text
sentence_without_bigrams = 'Who the president of Kuala Lumpur or Other Place is?'
bigrams = []

# loop until there are no remaining bi-grams
while True:
    # find bi-grams
    match = re.search('([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', sentence_without_bigrams)
    if match == None:
        break
    else:
        # add bi-gram to list of bi-grams
        bigrams.append(sentence_without_bigrams[match.start():match.end()])
        # remove bigram from sentence
        sentence_without_bigrams = (sentence_without_bigrams[:match.start()-1] + sentence_without_bigrams[match.end():])


print(bigrams)
>> ['Kuala Lumpur', 'Other Place']

print(sentence_without_bigrams)
>> Who the president of or is?

但是，这个解决方案没有达到您的最终目标，因为像'Hello, Mr President Obama'这样的句子不会被正确捕获（如here）

相关问题更多 >

编程相关推荐

热门问题

热门文章

考虑两个连续单词作为一个词频

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >