如何从elasticsearch中捕获特定字段并转换为datafram

2024-09-27 00:13:02 发布

您现在位置:Python中文网/ 问答频道 /正文

使用queryhttp://abc:9200/abc/_search?q=aid:123我可以得到以下结果。我想筛选的第一件事是“aid”,然后只捕获所需的字段(“act\u timestamp”、“act\u type”、“mod\u path”、“mod\u size”)以及“procguid”,并将它们放在一个表中。你知道吗

提供以下准确的输入和输出:

输入:

{
"took": 95,
"timed_out": false,
"_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
},
"hits": {
    "total": 4,
    "max_score": 1,
    "hits": [
        {
            "_index": "xxx",
            "_type": "_doc",
            "_id": "xxxxxxxxx",
            "_score": 1,
            "_source": {
                "aid": 123,
                "activity": [
                    {
                        "act_timestamp": 1549437711762,
                        "act_type": "reg_act",
                        "reg_action": "writeval",
                        "reg_action_typeid": 3
                    },
                    {
                        "act_timestamp": 1549437714812,
                        "act_type": "reg_act",
                        "reg_action": "writeval",
                        "reg_action_typeid": 3
                    },
                    {
                        "act_timestamp": 1549437711418,
                        "act_type": "child_proc",
                        "child_name": "audiodg.exe"
                    }
                ],
                "event_timestamp": 1549437712183,
                "tid": "demo",
                "procguid": "11111",
                "doc_id": 1000881655327721
            }
        },
        {
            "_index": "xxx",
            "_type": "_doc",
            "_id": "xxxxxxxxx",
            "_score": 1,
            "_source": {
                "aid": 123,
                "activity": [
                    {
                        "act_timestamp": 1549437977317,
                        "act_type": "mod_load",
                        "mod_path": "path1",
                        "mod_size": 32768
                    },
                    {
                        "act_timestamp": 1549437977390,
                        "act_type": "mod_load",
                        "mod_path": "path2",
                        "mod_size": 667648
                    },
                    {
                        "act_timestamp": 1549437977399,
                        "act_type": "reg_act",
                        "reg_action": "writeval",
                        "reg_action_typeid": 3
                    },
                    {
                        "act_timestamp": 1549437977407,
                        "act_type": "child_proc",
                        "child_name": "conhost.exe"
                    }
                ],
                "event_timestamp": 1549437977406,
                "tid": "demo",
                "procguid": "22222",
                "doc_id": 1010298962848944
            }
        },
        {
            "_index": "xxx",
            "_type": "_doc",
            "_id": "xxxxxxxxx",
            "_score": 1,
            "_source": {
                "aid": 123,
                "activity": [
                    {
                        "act_timestamp": 1549437772915,
                        "act_type": "mod_load",
                        "mod_path": "path3",
                        "mod_size": 409600
                    },
                    {
                        "act_timestamp": 1549437772940,
                        "act_type": "mod_load",
                        "mod_path": "path4",
                        "mod_size": 937984
                    },
                    {
                        "act_timestamp": 1549437953405,
                        "act_type": "child_proc",
                        "child_name": "conhost.exe"
                    }
                ],
                "event_timestamp": 1549437953405,
                "tid": "demo",
                "procguid": "22222",
                "doc_id": 1007612603810098
            }
        },
        {
            "_index": "xxx",
            "_type": "_doc",
            "_id": "xxxxxxxxx",
            "_score": 1,
            "_source": {
                "aid": 123,
                "activity": [
                    {
                        "act_timestamp": 1549437848842,
                        "act_type": "mod_load",
                        "mod_path": "path5",
                        "mod_size": 1679360
                    },
                    {
                        "act_timestamp": 1549437848844,
                        "act_type": "mod_load",
                        "mod_path": "path6",
                        "mod_size": 2121728
                    },
                    {
                        "act_timestamp": 1549437848864,
                        "act_type": "mod_load",
                        "mod_path": "path7",
                        "mod_size": 266240
                    },
                    {
                        "act_timestamp": 1549437849590,
                        "act_type": "reg_act",
                        "reg_action": "writeval",
                        "reg_action_typeid": 3
                    },
                    {
                        "act_timestamp": 1549437953418,
                        "act_type": "child_proc",
                        "child_name": "wpscloudsvr.exe"
                    }
                ],
                "event_timestamp": 1549437953417,
                "tid": "demo",
                "procguid": "33333",
                "doc_id": 1007725853753652
            }
        }]}}

我希望输出如下:

procguid   act_type   mod_path      mod_size

22222     mod_load     path1        32768
22222     mod_load     path2        667648
22222     mod_load     path3        409600
22222     mod_load     path4        937984
33333     mod_load     path5        1679360
33333     mod_load     path6        2121728
33333     mod_load     path7        266240

那么,有没有办法把这些海量的elasticsearch文档转换成这样的数据帧呢?你知道吗

有什么建议/解决方案代码吗?由于我不熟悉elasticsearch/json,我已经被困了很长时间。你知道吗

编辑:有些块在具有“procguid”时也缺少“activity”字段


Tags: pathidmodchildsizedoctypeload
1条回答
网友
1楼 · 发布于 2024-09-27 00:13:02

你可以把你的任务分解成几个部分。一开始你专注于

df = pd.io.json.json_normalize(d['hits']['hits'])

这有列

['_id', '_index', '_score', '_source.activity', '_source.aid', '_source.doc_id', '_source.event_timestamp', '_source.procguid', '_source.tid', '_type']

现在可以遍历列_source.activity,将每个单元格转换为数据帧并收集它们:

dfs = []
for idx, cell in df['_source.activity'].iteritems():
    df_activity = pd.io.json.json_normalize(cell)
    df_activity['procguid'] = df.loc[idx, '_source.procguid']
    dfs.append(df_activity)

把所有的东西放在一起(希望如此)就能得到预期的结果:

pd.concat(dfs, sort=False, ignore_index=True)[['procguid', 'act_type', 'mod_path', 'mod_size']]

相关问题 更多 >

    热门问题