移除列表中的停用词

2024-10-05 14:22:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我想删除列表中的停止字,同时保持格式不变(即列表)

下面是我已经尝试过的代码

sent1 = 'I have a sentence which is a list'
sent2 = 'I have a sentence which is another list'

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

lst = [sent1, sent2]
sent_lower = [t.lower() for t in lst]

filtered_words=[]
for i in sent_lower:
    i_split = i.split()
    lst = []
    for j in i_split:
        if j not in stop_words:
            lst.append(j)
            " ".join(lst)
            filtered_words.append(lst)

滤波字的电流输出:

filtered_words
[['sentence', 'list'],
 ['sentence', 'list'],
 ['sentence', 'another', 'list'],
 ['sentence', 'another', 'list'],
 ['sentence', 'another', 'list']]

所需的过滤字输出:

filtered_words
[['sentence', 'list'],
 ['sentence', 'another', 'list']]

我得到一份名单的副本。在循环中我可能做错了什么?还有比编写这么多for循环更好的方法吗?你知道吗


Tags: inwhich列表forishaveanotherlower
3条回答

一旦在filtered_words中有了重复的结果,就可以使用itertools

import itertools
filtered_words.sort()
list(filtered_words for filtered_words,_ in itertools.groupby(filtered_words))

结果是-

[['sentence', 'another', 'list'], ['sentence', 'list']]

我跟踪了StackOverflow上的链接-Remove duplicates from a list of list

你做错了的是每次你找到一个非停止词时都在lst后面加上filtered_words。这就是为什么你有2个重复的过滤sent1(它包含2个非停止词)和3个重复的过滤sent2(它包含3个非停止词)。 检查完每个句子后再加上:

for i in sent_lower:
    i_split = i.split()
    lst = []
    for j in i_split:
        if j not in stop_words:
            lst.append(j)
    filtered_words.append(lst)

顺便说一下,声明

" ".join(lst)

是没有用的,因为你正在计算一些东西(一个字符串),但没有把它存储在任何地方。你知道吗

编辑

使用列表理解的一种更为python的方法:

for s in sent_lower:
    lst = [j for j in s.split() if j not in stop_words]
    filtered_words.append(lst)

这会给你想要的结果

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

sent1 = 'I have a sentence which is a list'
sent2 = 'I have a sentence which is another list'

sent1 = sent1.lower().split()
sent2 = sent2.lower().split()

l = [sent1, sent2]

for n, sent in enumerate(l):
    for stop_word in stop_words:
        sent = [word for word in sent if word != stop_word]
    l[n] = sent

print(l)

相关问题 更多 >