PorterStemmer（）对句子中的最后一个词进行不同的修饰

import pandas as pd import re from nltk.stem import PorterStemmer test = {'grams': ['First value because one does two THREE', 'Second value because three and three four', 'Third donkey three']} test = pd.DataFrame(test, columns = ['grams']) STOPWORDS = {'and', 'does', 'because'} def rower(x): cleanQ = [] for i in range(len(x)): cleanQ.append(re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', x[i]).lower()) splitQ = [] for row in cleanQ: splitQ.append(row.split()) splitQ[:] = [[word for word in sub if word not in STOPWORDS] for sub in splitQ] splitQ = list(map(' '.join, splitQ)) print(splitQ) originQ = [] for i in splitQ: originQ.append(PorterStemmer().stem(i)) print(originQ) rower(test.grams)

1条回答

网友

1楼 · 发布于 2024-09-30 22:22:47

这里的主要错误是将多个单词传递给词干分析器，而不是一次传递一个单词。整个字符串（第三个）被视为一个单词，最后一部分是词干

import pandas as pd
import re
from nltk.stem import PorterStemmer

test = {'grams': ['First value because one does two THREE', 'Second value because three and three four',
                  'Third donkey three']}
test = pd.DataFrame(test, columns=['grams'])
STOPWORDS = {'and', 'does', 'because'}

ps = PorterStemmer()

def rower(x):
    cleanQ = []
    for i in range(len(x)): cleanQ.append(re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', x[i]).lower())

    splitQ = []
    for row in cleanQ: splitQ.append(row.split())
    splitQ = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
    print('IN:', splitQ)
    originQ = [[ps.stem(word) for word in sent] for sent in splitQ]
    print('OUT:', originQ)


rower(test.grams)

输出：

IN: [['first', 'value', 'one', 'two', 'three'], ['second', 'value', 'three', 'three', 'four'], ['third', 'donkey', 'three']]
OUT: [['first', 'valu', 'one', 'two', 'three'], ['second', 'valu', 'three', 'three', 'four'], ['third', 'donkey', 'three']]

有很好的解释为什么词干省略了某些单词的最后一个“e”。如果输出不符合你的期望，考虑使用LeMaMaTur.p>

How to stop NLTK stemmer from removing the trailing “e”?

相关问题更多 >

编程相关推荐

热门问题

热门文章