移除列表中的停用词

sent1 = 'I have a sentence which is a list' sent2 = 'I have a sentence which is another list' from nltk.corpus import stopwords stop_words = stopwords.words('english') lst = [sent1, sent2] sent_lower = [t.lower() for t in lst] filtered_words=[] for i in sent_lower: i_split = i.split() lst = [] for j in i_split: if j not in stop_words: lst.append(j) " ".join(lst) filtered_words.append(lst)

3条回答

网友

1楼 · 编辑于 2024-10-05 14:22:04

一旦在filtered_words中有了重复的结果，就可以使用itertools

import itertools
filtered_words.sort()
list(filtered_words for filtered_words,_ in itertools.groupby(filtered_words))

结果是-

[['sentence', 'another', 'list'], ['sentence', 'list']]

我跟踪了StackOverflow上的链接-Remove duplicates from a list of list

网友

2楼 · 编辑于 2024-10-05 14:22:04

你做错了的是每次你找到一个非停止词时都在lst后面加上filtered_words。这就是为什么你有2个重复的过滤sent1（它包含2个非停止词）和3个重复的过滤sent2（它包含3个非停止词）。检查完每个句子后再加上：

for i in sent_lower:
    i_split = i.split()
    lst = []
    for j in i_split:
        if j not in stop_words:
            lst.append(j)
    filtered_words.append(lst)

顺便说一下，声明

" ".join(lst)

是没有用的，因为你正在计算一些东西（一个字符串），但没有把它存储在任何地方。你知道吗

编辑

使用列表理解的一种更为python的方法：

for s in sent_lower:
    lst = [j for j in s.split() if j not in stop_words]
    filtered_words.append(lst)

网友
3楼 · 编辑于 2024-10-05 14:22:04

这会给你想要的结果

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

sent1 = 'I have a sentence which is a list'
sent2 = 'I have a sentence which is another list'

sent1 = sent1.lower().split()
sent2 = sent2.lower().split()

l = [sent1, sent2]

for n, sent in enumerate(l):
    for stop_word in stop_words:
        sent = [word for word in sent if word != stop_word]
    l[n] = sent

print(l)

相关问题更多 >

编程相关推荐

热门问题

热门文章