如何使用if语句更新字典?

2024-10-01 04:50:55 发布

您现在位置:Python中文网/ 问答频道 /正文

我想创建一个直方图,它是一个字典,显示每个单词长度在输入文本中有多少个具有该长度的单词。到目前为止,我已经设法创建了一个包含所有可能字长的词典,但似乎无法更新词典。我被错误缠住了:完整的Python回溯:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-4ae1bb3ffd5e> in <module>
----> 1 text2wordlengthPDF("R095-Big-data-vrije-veilige-samenleving.txt")

<timed exec> in text2wordlengthPDF(text)

TypeError: cannot unpack non-iterable int object

我的代码如下所示:

def text2wordlengthPDF(text):
    '''Read in the text document `text`, tokenize it using re.split and regex \W+, and create 
    the histogram of wordlenghts using the Counter method. Return this histogram. 
    The histogram is a dict showing for each wordlength how many words with that length are in the input text.'''

    #.read() is a way to retrieve strings from file object
    tokens = re.split(r'\W+', open(text, "r").read())
    tokens_counter = Counter(tokens)

    # create list of wordlength for items in Counter
    wordlength = list(dict.fromkeys([len(w) for w in tokens_counter ]))

    # Create dictionary with wordlength as key and occurrence as value
    dict_histogram = {i:0 for i in wordlength}
    for k,v in dict_histogram.items():
        if (k == len(w) for w in tokens_counter):
            k[v] = +1
    dict_histogram 

    print(dict_histogram)

# run and plot    
#pdf= text2wordlengthPDF(linktopdf())
#pdfS= pd.Series(pdf).sort_index()

#pdfS[pdfS>5].plot(kind='bar' ) #plot only the wordlenghts occurring more then 5 times.
#print(pdf) ```

#This is where I run my code with the input text
text2wordlengthPDF("R095-Big-data-vrije-veilige-samenleving.txt") 



Tags: andthetextinforinputiswith
1条回答
网友
1楼 · 发布于 2024-10-01 04:50:55

这部分

for k,v in dict_histogram.items():
    if (k == len(w) for w in tokens_counter):
        k[v] = +1

毫无意义k,关键字(每个单词的长度)不是字典。(此外,您可能是指k[v] += 1。)

更正确地重写它将导致

for k,v in dict_histogram.items():
    if (k == len(w) for w in tokens_counter):
        dict_histogram[k] += v

但这是行不通的。(事实上,我很惊讶if行不是一个完整的语法错误。它是有效语法吗?)但是值v仍然保存着原始的值,即0(从初始化开始)。你想要len(w)在那里;但是不能,因为它只是前一行中的局部变量

重写该部分的全部内容使我得出以下结论:

import re
from collections import Counter

def text2wordlengthPDF(text):
    '''Read in the text document `text`, tokenize it using re.split and regex \W+, and create 
    the histogram of wordlenghts using the Counter method. Return this histogram. 
    The histogram is a dict showing for each wordlength how many words with that length are in the input text.'''

    #.read() is a way to retrieve strings from file object
    tokens = re.split(r'\W+', open(text, "r", encoding="utf8").read())
    tokens_counter = Counter(tokens)

    # create list of wordlength for items in Counter
    wordlength = list(dict.fromkeys([len(w) for w in tokens_counter ]))

    # Create dictionary with wordlength as key and occurrence as value
    dict_histogram = {i:0 for i in wordlength}
    for key,occurrence in tokens_counter.items():
        dict_histogram[len(key)] += occurrence

    pprint.pprint(sorted(dict_histogram.items()), compact=True)

text2wordlengthPDF("pslrm.txt") 

从计数器中获取key作为,因此其值tokens_counter[key]是出现的次数。计数器的items()函数可以迭代这两个函数。
然后,将该数字添加到字典中,字典根据每个单词的长度编制索引。最后的sorted按升序列出了长度的出现情况:

[(0, 2), (1, 57262), (2, 54080), (3, 95251), (4, 132448), (5, 29969),
 (6, 62938), (7, 46593), (8, 23929), (9, 14645), (10, 12943), (11, 10708),
 (12, 2940), (13, 2742), (14, 1807), (15, 827), (16, 312), (17, 17965),
 (18, 91), (19, 118), (20, 147), (21, 24), (22, 35), (23, 7), (24, 13), (25, 1),
 (26, 24), (28, 1), (29, 24), (34, 1)]

(在我的测试语料库PostScript语言参考手册中,34个字符长的“word”恰好是一个随机的十六进制字符串:4c47494b4d4c524c4d50535051554c5152。那些其他过长的单词也同样没有意思,唉。)

相关问题 更多 >