如何在python中使用循环计算bigram

2024-06-25 05:47:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我在python中有一个特定的编码问题。在

Count = defaultdict(int)
for l in text:
   for m in l['reviews'].split():
      Count[m] += 1

print Count

text是一个如下所示的列表

^{pr2}$

如果我运行这段代码,就会得到这样的结果:

defaultdict(<type 'int'>, {'superficial,': 2, 'awesome': 1, 
'interesting': 3, 'A92': 2, ....

我想要的是一个二元数,而不是一元数。{cdi>尝试了{cdi>后面的代码}

Count = defaultdict(int)
for l in text:
    for m in l['reviews'].split():
       Count[m, m+1] += 1

我想使用类似的代码,而不是使用Stackoverflow中已经存在的其他代码。大多数现有的代码使用word list,但是我想直接从split()中计算bigram,它来自原始文本。在

我想得到类似这样的结果:

defaultdict(<type 'int'>, {('superficial', 'awesome'): 1, ('awesome, interesting'): 1, 
('interesting','A92'): 2, ....}

为什么会出现错误?如何修复此代码?在


Tags: 代码textinfortypecountawesomeint
3条回答

你想数一数相邻两个单词的数目吗?把它们做成元组。在

text = [{'ideology':3.4, 'ID':'50555', 'reviews':'Politician from CA-21, very liberal and aggressive'}]
Count = {}
for l in text:
   words = l['reviews'].split()
   for i in range(len(words)-1):
        if not (words[i],words[i+1]) in Count:
                Count[(words[i],words[i+1])] = 0
        Count[(words[i],words[i+1])] += 1

print Count

结果:

{('and','aggressive'):1,('from','CA-21,'):1,('political','from'):1,('CA-21,','very'):1,('very','freegative'):1,('freedom','and'):1}

如果我正确理解您的问题,下面的代码将解决您的问题。在

 Count = dict()
    for l in text:
        words = l['reviews'].split()
        for i in range(0,len(words) -1):
            bigram  = " ".join(words[i:i+2] )
            if not bigram  in Count:
                Count[bigram] = 1;
            else:
                Count[bigram] = Count[bigram] + 1

计数为:

^{pr2}$

在编辑:如果你想用key作为元组只需改变连接线。python dict也散列元组。在

有一种方法可以计算标准库中的对象,称为^{}。 另外,在^{}的帮助下,bigram计数器脚本可以如下所示:

from collections import Counter, defaultdict
from itertools import izip, tee

#function from 'recipes section' in standard documentation itertools page
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

text = [{'ideology': 3.4, 'ID': '50555',
 'reviews': 'Politician from CA-21, very liberal and aggressive'},
 {'ideology': 1.5, 'ID': '10223',
 'reviews': 'Retired politician'} ]

c = Counter()
for l in text:
   c.update(pairwise(l['reviews'].split()))

print c.items()

相关问题 更多 >