熊猫分词“无法哈希类型:‘列表’”

2024-09-30 02:24:56 发布

您现在位置:Python中文网/ 问答频道 /正文

以下示例:Twitter data mining with Python and Gephi: Case synthetic biology

CSV to: df['Country', 'Responses']

'Country'
Italy
Italy
France
Germany

'Responses' 
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."
"Loren ipsum..."
  1. 标记“响应”中的文本
  2. 删除100个最常见的单词(基于布朗.语料库)在
  3. 找出剩下的100个最常出现的单词

我可以完成第1步和第2步,但第3步出现错误:

^{pr2}$

我相信这是因为我在一个数据帧中工作,并且做了这个(可能是错误的)修改:

原始示例:

#divide to words
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(tweets)

我的代码:

#divide to words
tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

我的完整代码:

df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)

tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

words =  df['tokenized_sents']

#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

Out: ['the',
 ',',
 '.',
 'of',
 'and',
...]

#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

TypeError: unhashable type: 'list'

有很多问题都列在不好的名单上,但据我所知,没有一个是完全相同的。 有什么建议吗?谢谢。在


回溯

TypeError                                 Traceback (most recent call last)
<ipython-input-164-a0d17b850b10> in <module>()
  1 #keep only most common words
----> 2 fdist = FreqDist(words)
  3 mostcommon = fdist.most_common(100)
  4 mclist = []
  5 for i in range(len(mostcommon)):

/home/*******/anaconda3/envs/*******/lib/python3.5/site-packages/nltk/probability.py in __init__(self, samples)
    104         :type samples: Sequence
    105         """
--> 106         Counter.__init__(self, samples)
    107 
    108     def N(self):

/home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in __init__(*args, **kwds)
    521             raise TypeError('expected at most 1 arguments, got %d' % len(args))
    522         super(Counter, self).__init__()
--> 523         self.update(*args, **kwds)
    524 
    525     def __missing__(self, key):

/home/******/anaconda3/envs/******/lib/python3.5/collections/__init__.py in update(*args, **kwds)
    608                     super(Counter, self).update(iterable) # fast path when counter is empty
    609             else:
--> 610                 _count_elements(self, iterable)
    611         if kwds:
    612             self.update(kwds)

TypeError: unhashable type: 'list'

Tags: inselfmostdfforinitresponsescommon
1条回答
网友
1楼 · 发布于 2024-09-30 02:24:56

FreqDist函数接受散列对象的iterable(使其成为字符串,但它可能与任何对象一起工作)。你得到的错误是因为你传入了一个iterable的列表。正如您所建议的,这是因为您所做的更改:

df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

如果我正确地理解了Pandas apply function documentation,那么这一行就是将nltk.word_tokenize函数应用于某个系列。word-tokenize返回单词列表。在

作为解决方案,只需在尝试应用FreqDist之前将列表添加到一起,如下所示:

^{pr2}$

一个更完整的修改做你想做的。如果您只需要标识第二组100,请注意,mclist将在第二次使用该集合。在

df = pd.read_csv('CountryResponses.csv', encoding='utf-8', skiprows=0, error_bad_lines=False)

tokenizer = RegexpTokenizer(r'\w+')
df['tokenized_sents'] = df['Responses'].apply(nltk.word_tokenize)

lists =  df['tokenized_sents']
words = []
for wordList in lists:
    words += wordList

#remove 100 most common words based on Brown corpus
fdist = FreqDist(brown.words())
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
words = [w for w in words if w not in mclist]

Out: ['the',
 ',',
 '.',
 'of',
 'and',
...]

#keep only most common words
fdist = FreqDist(words)
mostcommon = fdist.most_common(100)
mclist = []
for i in range(len(mostcommon)):
    mclist.append(mostcommon[i][0])
# mclist contains second-most common set of 100 words
words = [w for w in words if w in mclist]
# this will keep ALL occurrences of the words in mclist

相关问题 更多 >

    热门问题