仅当我的列的每行中的单词不在停止词和字符串中时,我才想选择单词标点符号
这是我标记和删除停止字后的数据,我还想在删除停止字的同时删除标点符号。在usf后面的第二个字母中有逗号。我想到if word not in (stopwords,string.punctuation)
,因为它将是not in stopwords and not in string.punctuation
,我从here看到它,但它导致无法删除停止词和标点符号。如何解决这个问题
data['text'].head(5)
Out[38]:
0 ['ve, searching, right, words, thank, breather...
1 [free, entry, 2, wkly, comp, win, fa, cup, fin...
2 [nah, n't, think, goes, usf, ,, lives, around,...
3 [even, brother, like, speak, ., treat, like, a...
4 [date, sunday, !, !]
Name: text, dtype: object
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
data = pd.read_csv(r"D:/python projects/read_files/SMSSpamCollection.tsv",
sep='\t', header=None)
data.columns = ['label','text']
stopwords = set(stopwords.words('english'))
def process(df):
data = word_tokenize(df.lower())
data = [word for word in data if word not in (stopwords,string.punctuation)]
return data
data['text'] = data['text'].apply(process)
在函数过程中,必须将类型(字符串)转换为pandas.core.series.series并使用 海螺
该职能将是:
" def过程(df):
如果您仍然希望在一个
if
语句中执行此操作,则可以将string.punctuation
转换为一个集合,并将其与stopwords
和OR
操作结合起来。这就是它的样子:那你需要换衣服了
到
相关问题 更多 >
编程相关推荐