用python打印Unigram计数

2024-06-17 06:26:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个名为语料库.txt包含以下4行文本

 peter piper picked a peck of pickled peppers 
 a peck of pickled peppers peter piper picked 
 if peter piper picked a peck of pickled peppers 
 where s the peck of pickled peppers peter piper picked 

我希望程序的输出打印一个单词和它出现的次数,例如

^{pr2}$

等等

这是我写的代码

f = open("corpus.txt","r")
w, h = 100, 100;
k=1
a=0
uwordcount=[]
for i in range(100):
       uwordcount.append(0)
uword = [[0 for x in range(w)] for y in range(h)]
l = [[0 for x in range(w)] for y in range(h)] 
l[1] = f.readline()
l[2] = f.readline()
l[3] = f.readline()
l[4] = f.readline()
lwords = [[0 for x in range(w)] for y in range(h)] 
lwords[1]=l[1].split()
lwords[2]=l[2].split()
lwords[3]=l[3].split()
lwords[4]=l[4].split()
for i in [1,2,3,4]:
    for j in range(len(lwords[i])):
        uword[k]=lwords[i][j]
        uwordcount[k]=0
        for x in [1,2,3,4]:
            for y in range(len(lwords[i])):
                if uword[k] == lwords[x][y]:
                    uwordcount[k]=uwordcount[k]+1
        for z in range(k):
            if uword[k]==uword[z]:
                a=1

        if a==0:
            print(uwordcount[k],' ',uword[k])
            k=k+1

我得到了错误

回溯(最近一次呼叫): 文件“F:\New folder\1.py”,第25行,输入 如果uword[k]==lwords[x][y]: 索引器错误:列表索引超出范围

谁能告诉我这里有什么问题吗


Tags: ofinforreadlineifrangepeterpickled
3条回答

你的名单太多了。另外,不要依赖所有这些神奇的数字来计算行数、每个列表的最大单词数/条目数等等。不要为每行中的单词使用一个列表,而只需为所有单词使用一个列表。而不是第二个计数列表,只需使用dictionary来保存唯一的单词它们的计数:

with open("corpus.txt") as f:
    counts = {}
    for line in f:
        for word in line.split():
            if word not in counts:
                counts[word] = 1
            else:
                counts[word] += 1

之后,counts如下所示:{'peter': 4, 'piper': 4, 'picked': 4, 'a': 3, 'peck': 4, 'of': 4, 'pickled': 4, 'peppers': 4, 'if': 1, 'where': 1, 's': 1, 'the': 1}为了检索单词和计数,您还可以使用一个循环:

^{pr2}$

当然,您可以使用collections.Counter在更少的代码行中完成同样的操作,但我认为手动操作将有助于您进一步了解Python。在


老实说,我不明白for i in [1,2,3,4]:下面的任何代码应该做什么。似乎你想为单词创建一种共现矩阵?在这种情况下,我也建议使用一个(嵌套)字典,这样可以更容易地存储和检索antries。在

with open("corpus.txt") as f:
    matrix = {}
    for line in f:
        for word1 in line.split():
            if word1 not in matrix:
                matrix[word1] = {}
            for word2 in line.split():
                if word2 != word1:
                    if word2 not in matrix[word1]:
                        matrix[word1][word2] = 1
                    else:
                        matrix[word1][word2] += 1

代码几乎和以前一样,但是在同一行上有另一个嵌套循环循环。例如,"peter"的输出将是{'piper': 4, 'picked': 4, 'a': 3, 'peck': 4, 'of': 4, 'pickled': 4, 'peppers': 4, 'if': 1, 'where': 1, 's': 1, 'the': 1}

老实说,我没有得到你的代码,因为你有更多的循环和不必要的逻辑(我想)。所以我用我自己的方式来做。在

import pprint

with open('corups.txt', 'r') as cr:
     dic= {}  # Empty dictionary
     lines = cr.readlines()

     for line in lines:
         if line in dic:   # If key already exists in dic then add 1 to its value
             dic['line'] += 1

         else:
             dic['line'] = 1   # If key is not present in dic then create value as 1

pprint.pprint(dic)  # Using pprint built in function to print dictionary data types

If you are in real hurry then use collections.Counter

索引器错误:列表索引超出范围意味着您的某个索引试图访问列表之外的内容-您需要debug your code来查找情况。在


使用collections.Counter简化此任务:

# with open('corups.txt', 'r') as r: text = r.read()

text = """peter piper picked a peck of pickled peppers 
 a peck of pickled peppers peter piper picked 
 if peter piper picked a peck of pickled peppers 
 where s the peck of pickled peppers peter piper picked """

from collections import Counter

# split the text in lines, then each line into words and count those:
c = Counter( (x for y in text.strip().split("\n") for x in y.split()) )

# format the output
print(*(f"{cnt} {wrd}" for wrd,cnt in c.most_common()), sep="\n") 

输出:

^{pr2}$

相关:

相关问题 更多 >