Python收款.计数器把JSON中的东西排除在外

import re import collections import json file = open('message.json', encoding="utf8") a = file.read() words = re.findall(r'\w+', a) most_common = collections.Counter(map(str.lower, words)).most_common(50) print(most_common)

2条回答

网友

1楼 · 编辑于 2024-06-26 14:13:50

你试过把json当作字典来阅读并检查类型吗？你也可以在事后寻找不需要的单词并删除它们。你知道吗

import json
from collections import Counter

def get_words(string):
    return [word.lower() for word in string.split() if word.lower()]

def count_words(json_item):
    if isinstance(json_item, dict):
        for key, value in json_item.items():
            return count_words(key) + count_words(value)
    elif isinstance(value, str):
        return get_words(value)
    elif isinstance(value, list):
        return [word for string in value for word in count_words(string)]
    else:
        return []

with open('message.json', encoding="utf-8") as f:
    json_input = json.load(f)
counter = Counter(count_words(json_input))
result = { key: value for key, value in counter.items() if key not in UNWANTED_WORDS}

网友

2楼 · 编辑于 2024-06-26 14:13:50

问题是您在整个文件中使用findall，请执行以下操作：

import re
import collections
import json


def words(s):
    return re.findall('\w+', s, re.UNICODE | re.IGNORECASE)

file = open('message.json', encoding="utf8")
data = json.load(file)

counts = collections.Counter((w.lower() for e in data for w in words(e.get('content', ''))))
most_common = counts.most_common(50)
print(most_common)

输出

[('siä', 1), ('ci', 1), ('podobajä', 1)]

输出用于具有以下内容的文件（JSON对象列表）：

[{
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
}]

解释

使用json.load将文件的内容作为字典列表data加载，然后迭代字典的元素，并使用函数words和Counter计算'content'字段的单词数

进一步

要删除I、a和but等词，请参见this

更新

给定文件的格式，您需要将行：data = json.load(file)更改为data = json.load(file)["messages"]，用于以下内容：

{
  "participants":[],
  "messages": [
    {
      "sender_name": "xxxxxx",
      "timestamp_ms": 1540327935616,
      "content": "Podobaj\u00c4\u0085 ci si\u00c4\u0099",
      "type": "Generic"
    },
    {
      "sender_name": "aaa",
      "timestamp_ms": 1540329382942,
      "content": "aaa",
      "type": "Generic"
    },
    {
      "sender_name": "aaa",
      "timestamp_ms": 1540329262248,
      "content": "aaa",
      "type": "Generic"
    }
  ]
}

输出为：

[('aaa', 2), ('siä', 1), ('podobajä', 1), ('ci', 1)]

相关问题更多 >

编程相关推荐

热门问题

热门文章