关键词抽取:复数/单数/过去时/ing形式的同一单词

2024-06-28 20:23:29 发布

您现在位置:Python中文网/ 问答频道 /正文

当从文本中提取关键字时,我意识到我得到的大部分是不同格式的相同单词。有没有办法让同一个词只出现一次?你知道吗

Example: updated updates update updating | research researched researchers | files filed file

代码:此处使用的Summa(TextRank)包:

k_words = keywords.keywords((str(document)), words=10, ratio=0.2, language='english')

Tags: 文本example格式update关键字单词words意识
1条回答
网友
1楼 · 发布于 2024-06-28 20:23:29

在对文本进行任何操作之前,您需要对其进行词干和修饰(同时,删除停止词和标点符号)。NLTK有内置的lemmatizers和词干分析器,您可以使用:

用于填塞:

import nltk

from nltk.stem import PorterStemmer

porter = PorterStemmer()

print(porter.stem("cats"))  #  =>  cat
print(porter.stem("trouble"))  #  =>  troubl
print(porter.stem("troubling"))  #  =>  troubl
print(porter.stem("troubled"))  #  =>  troubl

From DataCamp:

"Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language."

对于柠檬化:

from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

wordnet_lemmatizer.lemmatize("has")  #  =>  has
wordnet_lemmatizer.lemmatize("was")  #  =>  wa

From DataCamp:

"Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words."

您可以在this article中阅读更多关于Python-NLTK词干分析和柠檬化的内容。你知道吗

相关问题 更多 >