Mongodb聚合透视嵌入式lis

2024-09-25 00:32:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个带有嵌入式列表的文档结构,在我使用聚合管道展开该结构之后,我最终得到以下内容

注意:我将Python与pymongo一起用于此查询

pipeline = [
    {'$unwind': '$saved_alloys'},
    {
        '$project': {
            '_id': 0,
            'name': '$saved_alloys.name',
            'compositions': '$saved_alloys.compositions'
        }
    }
]

res = db['alloys'].aggregate(pipeline)

for e in res:
  print(e)

输出(截断):

{
  'name': 'alloy-1', 
  'compositions': [
    {'symbol': 'C', 'weight': 0.36},
    {'symbol': 'Mn', 'weight': 1.41}
    {'symbol': 'Si', 'weight': 1.03},
    {'symbol': 'Ni', 'weight': 1.7}
  ]
}
{
  'name': 'alloy-2',
  'compositions': [
    {'symbol': 'C', 'weight': 0.21}, 
    {'symbol': 'Mn', 'weight': 0.23}, 
    {'symbol': 'Si', 'weight': 0.86},
    {'symbol': 'Ni', 'weight': 0.67},
    {'symbol': 'Cr', 'weight': 0.12},
  ]
}
...

我试图通过将合成物的维数降低到笛卡尔平面来对此进行一些数据分析。因此,我想通过执行pd.DataFrame(list(res))将其放入熊猫数据帧中

最后,我想给出一个具有以下结构的表:

{
  "name": "alloy-1",
  "C": 0.36,
  "Mn": 1.41,
  "Si": 1.03,
  "Ni": 1.7,
  "Cr": 0.0
},
{
  "name": "alloy-2",
  "C": 0.21,
  "Mn": 0.23,
  "Si": 0.86,
  "Ni": 0.67,
  "Cr": 0.12
}

注意compositions列表可以有可变的大小,因此对于那些不在列表中的元素,我想添加它们,但是weights值为零(如Cralloy-1

提前感谢您的帮助


Tags: name列表pipelineressymbol结构crweight
1条回答
网友
1楼 · 发布于 2024-09-25 00:32:51

毫无疑问,这是可以优化的,但作为一种简单的入门方法,为每个返回的输出构造一个熊猫系列,并附加到数据帧中;最后用0.0替换任何“缺失”值

from pymongo import MongoClient
import pandas as pd
import numpy as np

db = MongoClient()["mydatabase"]

db.alloys.insert_one({
    'saved_alloys': [{
        'name': 'alloy-1',
        'compositions': [
            {'symbol': 'C', 'weight': 0.36},
            {'symbol': 'Mn', 'weight': 1.41},
            {'symbol': 'Si', 'weight': 1.03},
            {'symbol': 'Ni', 'weight': 1.7}
        ]
    },
        {
            'name': 'alloy-2',
            'compositions': [
                {'symbol': 'C', 'weight': 0.21},
                {'symbol': 'Mn', 'weight': 0.23},
                {'symbol': 'Si', 'weight': 0.86},
                {'symbol': 'Ni', 'weight': 0.67},
                {'symbol': 'Cr', 'weight': 0.12},
            ]
        }]
}
)

pipeline = [
    {'$unwind': '$saved_alloys'},
    {
        '$project': {
            '_id': 0,
            'name': '$saved_alloys.name',
            'compositions': '$saved_alloys.compositions'
        }
    }
]

res = db['alloys'].aggregate(pipeline)
df = pd.DataFrame()

for alloy in res:
    ser = pd.Series()
    # Set the series name as the alloy
    ser.name = alloy['name']

    for composition in alloy['compositions']:
        # Add in each alloy to the series
        ser.at[composition['symbol']] = composition['weight']

    df = df.append(ser)

# Once we have our DataFrame, replace any missing values with 0.0
df = df.replace(np.nan, 0.0)
print(df)

结果:

            C    Mn    Ni    Si    Cr
alloy-1  0.36  1.41  1.70  1.03  0.00
alloy-2  0.21  0.23  0.67  0.86  0.12

相关问题 更多 >