<p>问题是您在整个文件中使用<code>findall</code>,请执行以下操作:</p>
<pre><code>import re
import collections
import json
def words(s):
return re.findall('\w+', s, re.UNICODE | re.IGNORECASE)
file = open('message.json', encoding="utf8")
data = json.load(file)
counts = collections.Counter((w.lower() for e in data for w in words(e.get('content', ''))))
most_common = counts.most_common(50)
print(most_common)
</code></pre>
<p><strong>输出</strong></p>
<pre><code>[('siä', 1), ('ci', 1), ('podobajä', 1)]
</code></pre>
<p>输出用于具有以下内容的文件(JSON对象列表):</p>
<pre><code>[{
"sender_name": "xxxxxx",
"timestamp_ms": 1540327935616,
"content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
"type": "Generic"
}]
</code></pre>
<p><strong>解释</strong></p>
<p>使用<code>json.load</code>将文件的内容作为字典列表<code>data</code>加载,然后迭代字典的元素,并使用函数<code>words</code>和<code>Counter</code>计算<code>'content'</code>字段的单词数</p>
<p><strong>进一步</strong></p>
<ol>
<li>要删除I、a和but等词,请参见<a href="https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python">this</a></li>
</ol>
<p><strong>更新</p>
<p>给定文件的格式,您需要将行:<code>data = json.load(file)</code>更改为<code>data = json.load(file)["messages"]</code>,用于以下内容:</p>
<pre><code>{
"participants":[],
"messages": [
{
"sender_name": "xxxxxx",
"timestamp_ms": 1540327935616,
"content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
"type": "Generic"
},
{
"sender_name": "aaa",
"timestamp_ms": 1540329382942,
"content": "aaa",
"type": "Generic"
},
{
"sender_name": "aaa",
"timestamp_ms": 1540329262248,
"content": "aaa",
"type": "Generic"
}
]
}
</code></pre>
<p>输出为:</p>
<pre><code>[('aaa', 2), ('siä', 1), ('podobajä', 1), ('ci', 1)]
</code></pre>