将具有排序唯一值的嵌套数据帧转换为Python中的嵌套字典

2024-10-01 09:31:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试获取嵌套的数据帧并将其转换为嵌套的字典。你知道吗

这是我的原始数据帧,具有以下唯一值:

输入:df.head(5)

输出:

    reviewerName                                  title    reviewerRatings
0        Charles       Harry Potter Book Seven News:...                3.0
1      Katherine       Harry Potter Boxed Set, Books...                5.0
2           Lora       Harry Potter and the Sorcerer...                5.0
3           Cait       Harry Potter and the Half-Blo...                5.0
4          Diane       Harry Potter and the Order of...                5.0

输入:len(df['reviewerName'].unique())

输出:66130

考虑到66130 unqiue值中的每个值都有多个值(即“Charles”将出现3次),我将66130唯一的“reviewerName”赋值为新嵌套数据帧中的键,然后使用“title”和“reviewerRatings”作为另一层属性来指定关键字:值相同的嵌套数据帧。你知道吗

输入:df = df.set_index(['reviewerName', 'title']).sort_index()

输出:

                                                       reviewerRatings
    reviewerName                               title
         Charles    Harry Potter Book Seven News:...               3.0
                    Harry Potter and the Half-Blo...               3.5
                    Harry Potter and the Order of...               4.0
       Katherine    Harry Potter Boxed Set, Books...               5.0
                    Harry Potter and the Half-Blo...               2.5
                    Harry Potter and the Order of...               5.0
...
230898 rows x 1 columns

作为后续行动 first question,我试图将嵌套的数据帧转换为嵌套的字典。你知道吗

上面新的嵌套DataFrame列索引在第一行(第3列)显示“reviewerRatings”,在第二行(第1列和第2列)显示“reviewerName”和“title”,当我运行下面的df.to_dict()方法时,输出显示{reviewerRatingsIndexName: {(reviewerName, title): reviewerRatings}}

输入:df.to_dict()

输出:

{'reviewerRatings': 
 {
  ('Charles', 'Harry Potter Book Seven News:...'): 3.0, 
  ('Charles', 'Harry Potter and the Half-Blo...'): 3.5, 
  ('Charles', 'Harry Potter and the Order of...'): 4.0,   
  ('Katherine', 'Harry Potter Boxed Set, Books...'): 5.0, 
  ('Katherine', 'Harry Potter and the Half-Blo...'): 2.5, 
  ('Katherine', 'Harry Potter and the Order of...'): 5.0,
 ...}
}

但是对于下面我想要的输出,我希望以{reviewerName: {title: reviewerRating}}的形式获得我的输出,这正是我在嵌套数据帧中的排序方式。你知道吗

{'Charles': 
 {'Harry Potter Book Seven News:...': 3.0, 
  'Harry Potter and the Half-Blo...': 3.5, 
  'Harry Potter and the Order of...': 4.0},   
 'Katherine':
 {'Harry Potter Boxed Set, Books...': 5.0, 
  'Harry Potter and the Half-Blo...': 2.5, 
  'Harry Potter and the Order of...': 5.0},
...}

是否有任何方法可以操作嵌套的数据帧或嵌套的字典,以便在运行df.to_dict()方法时,它将显示{reviewerName: {title: reviewerRating}}。你知道吗

谢谢!你知道吗


Tags: andofthe数据dftitleordercharles
2条回答

有几种方法。可以将groupbyto_dict一起使用,也可以使用collections.defaultdict迭代行。值得注意的是,后者并不一定效率较低。你知道吗

^{}+^{}

从每个groupby对象构造一个序列,并将其转换为字典以给出一系列字典值。最后,通过另一个to_dict调用将其转换为字典。你知道吗

res = df.groupby('reviewerName')\
        .apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
        .to_dict()

^{}

定义defaultdictdict对象并逐行迭代数据帧。你知道吗

from collections import defaultdict

res = defaultdict(dict)
for row in df.itertuples(index=False):
    res[row.reviewerName][row.title] = row.reviewerRatings

结果defaultdict不需要转换回常规dict,因为defaultdictdict的子类。你知道吗

绩效基准

基准测试是建立和数据相关的。你应该用你自己的数据来测试,看看什么最有效。你知道吗

# Python 3.6.5, Pandas 0.19.2

from collections import defaultdict
from random import sample

# construct sample dataframe
np.random.seed(0)
n = 10**4  # number of rows
names = np.random.choice(['Charles', 'Lora', 'Katherine', 'Matthew',
                          'Mark', 'Luke', 'John'], n)
books = [f'Book_{i}' for i in sample(range(10**5), n)]
ratings = np.random.randint(0, 6, n)

df = pd.DataFrame({'reviewerName': names, 'title': books, 'reviewerRatings': ratings})

def jez(df):
    return df.groupby('reviewerName')['title','reviewerRatings']\
             .apply(lambda x: dict(x.values))\
             .to_dict()

def jpp1(df):
    return df.groupby('reviewerName')\
             .apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
             .to_dict()

def jpp2(df):
    dd = defaultdict(dict)
    for row in df.itertuples(index=False):
        dd[row.reviewerName][row.title] = row.reviewerRatings
    return dd

%timeit jez(df)   # 33.5 ms per loop
%timeit jpp1(df)  # 17 ms per loop
%timeit jpp2(df)  # 21.1 ms per loop

对每个reviewerNamedictionaries使用^{}和lambda函数,然后输出Seriesconvert by ^{}

print (df)
  reviewerName                             title  reviewerRatings
0      Charles  Harry Potter Book Seven News:...              3.0
1      Charles  Harry Potter Boxed Set, Books...              5.0
2      Charles  Harry Potter and the Sorcerer...              5.0
3    Katherine  Harry Potter and the Half-Blo...              5.0
4    Katherine   Harry otter and the Order of...              5.0

d = (df.groupby('reviewerName')['title','reviewerRatings']
       .apply(lambda x: dict(x.values))
       .to_dict())
print (d)

{
    'Charles': {
        'Harry Potter Book Seven News:...': 3.0,
        'Harry Potter Boxed Set, Books...': 5.0,
        'Harry Potter and the Sorcerer...': 5.0
    },
    'Katherine': {
        'Harry Potter and the Half-Blo...': 5.0,
        'Harry otter and the Order of...': 5.0
    }
}

相关问题 更多 >