如何创建wordcloud,使用Python在文本中显示最常见的bigram?

2024-09-27 00:12:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用textblob分析twitter数据。我的twitter文本中最常用的bigram及其各自的频率被检索并存储在列表变量“l”中,如下所示

from textblob import TextBlob
blob = TextBlob(text)

import nltk, re, string, collections
from nltk.util import ngrams

'first get individual words'
tokenized = blob.split()

'and get a list of all the bi-grams'
Bigrams = ngrams(tokenized, 2)
Bigrams

'get the frequency of each bigram '
BigramFreq = collections.Counter(Bigrams)
BigramFreq

' what are the ten most popular bigrams '
l = BigramFreq.most_common(10)
l

在这里,“l”的输出是一个列表,其中包含在运行上述代码后显示的每个双RAM的双RAM和频率,如下所示:

  [(('@UniverCurious:', 'The'), 39),
 (('The', 'underside'), 38),
 (('underside', 'of'), 38),
 (('of', 'Jupiter.'), 38),
 (('Jupiter.', 'Credit:'), 38),
 (('Credit:', 'NASA/JPL/JUNO'), 38),
 (('to', 'the'), 25),
 (('just', '100'), 15),
 (('20', 'years'), 14)]

现在我可以从最常见的bigram创建一个表。但是我需要帮助从上面给定的代码创建wordcloud

我的问题是如何从列表“l”创建wordcloud


Tags: ofthefromimport列表gettwitterblob
2条回答

谢谢!表创建现在可以正常工作了。 现在我扩展了创建wordcloud的代码,但它给出了一个错误“TypeError:expected string”。我扩展的代码如下所示:

 'converting the list 'words' into a dictionary 'dict'. Dictionary is 
  to be used for creation of wordcloud.'

  d = {}
  for  ngram_list,cnt_list in l:
       d[ngram_list]= cnt_list
    
  d  

   from wordcloud import WordCloud
  'generate a word cloud from a dictionary of frequencies'
   wordcloud = WordCloud(colormap='prism').generate_from_frequencies(d)
   wordcloud.to_image()

错误如下所示:

                                     -
TypeError                                 Traceback (most recent call last)
<ipython-input-35-9de1b9f89116> in <module>
     17 
     18 # generate a word cloud from a dictionary of frequencies
 -> 19 wordcloud = WordCloud(colormap = 'prism').generate_from_frequencies(d)
     20 wordcloud.to_image()

~\Anaconda3\lib\site-packages\wordcloud\wordcloud.py in generate_from_frequencies(self, frequencies, max_font_size)
    432             else:
    433                 self.generate_from_frequencies(dict(frequencies[:2]),
 > 434                                                max_font_size=self.height)
    435                 # find font sizes
    436                 sizes = [x[1] for x in self.layout_]

~\Anaconda3\lib\site-packages\wordcloud\wordcloud.py in generate_from_frequencies(self, frequencies, max_font_size)
    486                     font, orientation=orientation)
    487                 # get size of resulting text
 > 488                 box_size = draw.textsize(word, font=transposed_font)
    489                 # find possible places using integral image:
    490                 result = occupancy.sample_position(box_size[1] + self.margin,

~\Anaconda3\lib\site-packages\PIL\ImageDraw.py in textsize(self, text, font, spacing, direction, features, language)
    337         if font is None:
    338             font = self.getfont()
 > 339         return font.getsize(text, direction, features, language)
    340 
    341     def multiline_textsize(

~\Anaconda3\lib\site-packages\PIL\ImageFont.py in getsize(self, text, *args, **kwargs)
    489 
    490     def getsize(self, text, *args, **kwargs):
 > 491         w, h = self.font.getsize(text)
    492         if self.orientation in (Image.ROTATE_90, Image.ROTATE_270):
    493             return h, w

~\Anaconda3\lib\site-packages\PIL\ImageFont.py in getsize(self, text, direction, features, language)
    221         :return: (width, height)
    222         """
 > 223         size, offset = self.font.getsize(text, direction, features, language)
    224         return (size[0] + offset[0], size[1] + offset[1])
    225 

TypeError: expected string

如果我在这里做错了什么,请告诉我

In [1]: import pandas as pd

In [2]: a =  [(('@UniverCurious:', 'The'), 39),
   ...:  (('The', 'underside'), 38),
   ...:  (('underside', 'of'), 38),
   ...:  (('of', 'Jupiter.'), 38),
   ...:  (('Jupiter.', 'Credit:'), 38),
   ...:  (('Credit:', 'NASA/JPL/JUNO'), 38),
   ...:  (('to', 'the'), 25),
   ...:  (('just', '100'), 15),
   ...:  (('20', 'years'), 14)]

In [3]: ngram_list = [" ".join(p[0]) for p in a]

In [4]: cnt_list = [p[1] for p in a]

In [5]: df = pd.DataFrame(list(zip(ngram_list, cnt_list)), columns=['bigram', 'cnt'])

In [6]: df
Out[6]:
                  bigram  cnt
0    @UniverCurious: The   39
1          The underside   38
2           underside of   38
3            of Jupiter.   38
4       Jupiter. Credit:   38
5  Credit: NASA/JPL/JUNO   38
6                 to the   25
7               just 100   15
8               20 years   14

这个怎么样?对于wordcloud,您可能需要使用其他模块,如wordcloud。有关示例,请参见this link

相关问题 更多 >

    热门问题