回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>仅当我的列的每行中的单词<strong>不在停止词和字符串中时,我才想选择单词</strong><strong>标点符号</strong></p>
<p>这是我标记和删除停止字后的数据,我还想在删除停止字的同时删除标点符号。在usf后面的第二个字母中有逗号。我想到<code>if word not in (stopwords,string.punctuation)</code>,因为它将是<code>not in stopwords and not in string.punctuation</code>,我从<a href="https://stackoverflow.com/questions/1075652/using-the-and-and-not-operator-in-python">here</a>看到它,但它导致无法删除停止词和标点符号。如何解决这个问题</p>
<pre><code>data['text'].head(5)
Out[38]:
0 ['ve, searching, right, words, thank, breather...
1 [free, entry, 2, wkly, comp, win, fa, cup, fin...
2 [nah, n't, think, goes, usf, ,, lives, around,...
3 [even, brother, like, speak, ., treat, like, a...
4 [date, sunday, !, !]
Name: text, dtype: object
</code></pre>
<pre><code>import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
data = pd.read_csv(r"D:/python projects/read_files/SMSSpamCollection.tsv",
sep='\t', header=None)
data.columns = ['label','text']
stopwords = set(stopwords.words('english'))
def process(df):
data = word_tokenize(df.lower())
data = [word for word in data if word not in (stopwords,string.punctuation)]
return data
data['text'] = data['text'].apply(process)
</code></pre>