文本分析:使用python查找列中最常用的单词

2024-09-30 14:32:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我创建了一个数据帧,其中只有一列带有主题行。你知道吗

df = activities.filter(['Subject'],axis=1)
df.shape

此操作返回此数据帧:

    Subject
0   Call Out: Quadria Capital - May Lo, VP
1   Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
2   Columbia Partners: WW Worked (Not Sure Will Ev...
3   Meeting, Sophie, CFO, CDC Investment
4   Prospecting

然后我试着用以下代码分析文本:

import nltk
top_N = 50
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)

stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords) 

rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
print(rslt)

我得到的错误消息是:“Series”对象没有“Subject”属性


Tags: 数据txtdfdisttopcalloutword
2条回答

数据:

Subject
"Call Out: Quadria Capital - May Lo, VP"
Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
Columbia Partners: WW Worked (Not Sure Will Ev...
"Meeting, Sophie, CFO, CDC Investment"
Prospecting

# read in the data
df = pd.read_clipboard(sep=',')

enter image description here

更新代码:

  • 将所有单词转换为小写,并删除所有非字母数字字符
    • txt = df.Subject.str.lower().str.replace(r'\|', ' ')创建pandas.core.series.Series并将被替换
  • words = nltk.tokenize.word_tokenize(txt),抛出一个TypeError,因为txt是一个Series
    • 下面的代码标记数据帧的每一行
  • 对单词进行标记,将每个字符串分割成list。在本例中,查看df将显示一个tok列,其中每一行都是一个列表
import nltk
import pandas as pd

top_N = 50

# replace all non-alphanumeric characters
df['sub_rep'] = df.Subject.str.lower().str.replace('\W', ' ')

# tokenize
df['tok'] = df.sub_rep.apply(nltk.tokenize.word_tokenize)

enter image description here

  • 为了分析列中的所有单词,将各个行列表合并到一个名为words的列表中。你知道吗
# all tokenized words to a list
words = df.tok.tolist()  # this is a list of lists
words = [word for list_ in words for word in list_]

# frequency distribution
word_dist = nltk.FreqDist(words)

# remove stopwords
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)

# output the results
rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])

输出rslt

        Word  Frequency
        call          2
         out          2
     quadria          1
     capital          1
         may          1
          lo          1
          vp          1
  revelstoke          1
     anthony          1
       hayes          1
          sr          1
       assoc          1
    columbia          1
    partners          1
          ww          1
      worked          1
         not          1
        sure          1
        will          1
          ev          1
     meeting          1
      sophie          1
         cfo          1
         cdc          1
  investment          1
 prospecting          1

引发错误的原因是您已将df转换为此行中的序列:

df = activities.filter(['Subject'],axis=1)

所以当你说:

txt = df.Subject.str.lower().str.replace(r'\|', ' ')

df是序列,没有序列属性。尝试替换为:

txt = df.str.lower().str.replace(r'\|', ' ')

或者,不要在之前或之后将数据帧过滤为单个序列

txt = df.Subject.str.lower().str.replace(r'\|', ' ')

应该有用。你知道吗

[更新]

我上面所说的是不正确的,正如所指出的,filter不返回一个序列,而是返回一个只有一列的数据帧。你知道吗

相关问题 更多 >