我想创建一个直方图,它是一个字典,显示每个单词长度在输入文本中有多少个具有该长度的单词。到目前为止,我已经设法创建了一个包含所有可能字长的词典,但似乎无法更新词典。我被错误缠住了:完整的Python回溯:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-36-4ae1bb3ffd5e> in <module>
----> 1 text2wordlengthPDF("R095-Big-data-vrije-veilige-samenleving.txt")
<timed exec> in text2wordlengthPDF(text)
TypeError: cannot unpack non-iterable int object
我的代码如下所示:
def text2wordlengthPDF(text):
'''Read in the text document `text`, tokenize it using re.split and regex \W+, and create
the histogram of wordlenghts using the Counter method. Return this histogram.
The histogram is a dict showing for each wordlength how many words with that length are in the input text.'''
#.read() is a way to retrieve strings from file object
tokens = re.split(r'\W+', open(text, "r").read())
tokens_counter = Counter(tokens)
# create list of wordlength for items in Counter
wordlength = list(dict.fromkeys([len(w) for w in tokens_counter ]))
# Create dictionary with wordlength as key and occurrence as value
dict_histogram = {i:0 for i in wordlength}
for k,v in dict_histogram.items():
if (k == len(w) for w in tokens_counter):
k[v] = +1
dict_histogram
print(dict_histogram)
# run and plot
#pdf= text2wordlengthPDF(linktopdf())
#pdfS= pd.Series(pdf).sort_index()
#pdfS[pdfS>5].plot(kind='bar' ) #plot only the wordlenghts occurring more then 5 times.
#print(pdf) ```
#This is where I run my code with the input text
text2wordlengthPDF("R095-Big-data-vrije-veilige-samenleving.txt")
这部分
毫无意义
k
,关键字(每个单词的长度)不是字典。(此外,您可能是指k[v] += 1
。)更正确地重写它将导致
但这是行不通的。(事实上,我很惊讶
if
行不是一个完整的语法错误。它是有效语法吗?)但是值v
仍然保存着原始的键值,即0
(从初始化开始)。你想要len(w)
在那里;但是不能,因为它只是前一行中的局部变量重写该部分的全部内容使我得出以下结论:
从计数器中获取
key
作为字,因此其值tokens_counter[key]
是出现的次数。计数器的items()
函数可以迭代这两个函数。然后,将该数字添加到字典中,字典根据每个单词的长度编制索引。最后的
sorted
按升序列出了长度的出现情况:(在我的测试语料库PostScript语言参考手册中,34个字符长的“word”恰好是一个随机的十六进制字符串:
4c47494b4d4c524c4d50535051554c5152
。那些其他过长的单词也同样没有意思,唉。)相关问题 更多 >
编程相关推荐