如何使用if语句更新字典？

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-36-4ae1bb3ffd5e> in <module> ----> 1 text2wordlengthPDF("R095-Big-data-vrije-veilige-samenleving.txt") <timed exec> in text2wordlengthPDF(text) TypeError: cannot unpack non-iterable int object

def text2wordlengthPDF(text): '''Read in the text document `text`, tokenize it using re.split and regex \W+, and create the histogram of wordlenghts using the Counter method. Return this histogram. The histogram is a dict showing for each wordlength how many words with that length are in the input text.''' #.read() is a way to retrieve strings from file object tokens = re.split(r'\W+', open(text, "r").read()) tokens_counter = Counter(tokens) # create list of wordlength for items in Counter wordlength = list(dict.fromkeys([len(w) for w in tokens_counter ])) # Create dictionary with wordlength as key and occurrence as value dict_histogram = {i:0 for i in wordlength} for k,v in dict_histogram.items(): if (k == len(w) for w in tokens_counter): k[v] = +1 dict_histogram print(dict_histogram) # run and plot #pdf= text2wordlengthPDF(linktopdf()) #pdfS= pd.Series(pdf).sort_index() #pdfS[pdfS>5].plot(kind='bar' ) #plot only the wordlenghts occurring more then 5 times. #print(pdf) ``` #This is where I run my code with the input text text2wordlengthPDF("R095-Big-data-vrije-veilige-samenleving.txt")

1条回答

网友

1楼 · 发布于 2024-10-01 04:50:55

这部分

for k,v in dict_histogram.items():
    if (k == len(w) for w in tokens_counter):
        k[v] = +1

毫无意义k，关键字（每个单词的长度）不是字典。（此外，您可能是指k[v] += 1。）

更正确地重写它将导致

for k,v in dict_histogram.items():
    if (k == len(w) for w in tokens_counter):
        dict_histogram[k] += v

但这是行不通的。（事实上，我很惊讶if行不是一个完整的语法错误。它是有效语法吗？）但是值v仍然保存着原始的键值，即0（从初始化开始）。你想要len(w)在那里；但是不能，因为它只是前一行中的局部变量

重写该部分的全部内容使我得出以下结论：

import re
from collections import Counter

def text2wordlengthPDF(text):
    '''Read in the text document `text`, tokenize it using re.split and regex \W+, and create 
    the histogram of wordlenghts using the Counter method. Return this histogram. 
    The histogram is a dict showing for each wordlength how many words with that length are in the input text.'''

    #.read() is a way to retrieve strings from file object
    tokens = re.split(r'\W+', open(text, "r", encoding="utf8").read())
    tokens_counter = Counter(tokens)

    # create list of wordlength for items in Counter
    wordlength = list(dict.fromkeys([len(w) for w in tokens_counter ]))

    # Create dictionary with wordlength as key and occurrence as value
    dict_histogram = {i:0 for i in wordlength}
    for key,occurrence in tokens_counter.items():
        dict_histogram[len(key)] += occurrence

    pprint.pprint(sorted(dict_histogram.items()), compact=True)

text2wordlengthPDF("pslrm.txt")

从计数器中获取key作为字，因此其值tokens_counter[key]是出现的次数。计数器的items()函数可以迭代这两个函数。
然后，将该数字添加到字典中，字典根据每个单词的长度编制索引。最后的sorted按升序列出了长度的出现情况：

[(0, 2), (1, 57262), (2, 54080), (3, 95251), (4, 132448), (5, 29969),
 (6, 62938), (7, 46593), (8, 23929), (9, 14645), (10, 12943), (11, 10708),
 (12, 2940), (13, 2742), (14, 1807), (15, 827), (16, 312), (17, 17965),
 (18, 91), (19, 118), (20, 147), (21, 24), (22, 35), (23, 7), (24, 13), (25, 1),
 (26, 24), (28, 1), (29, 24), (34, 1)]

（在我的测试语料库PostScript语言参考手册中，34个字符长的“word”恰好是一个随机的十六进制字符串：4c47494b4d4c524c4d50535051554c5152。那些其他过长的单词也同样没有意思，唉。）

相关问题更多 >

编程相关推荐

热门问题

热门文章