<p>有几种方法。可以将<code>groupby</code>与<code>to_dict</code>一起使用,也可以使用<code>collections.defaultdict</code>迭代行。值得注意的是,后者并不一定效率较低。你知道吗</p>
<h3><a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html" rel="nofollow noreferrer">^{<cd1>}</a>+<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.to_dict.html" rel="nofollow noreferrer">^{<cd2>}</a></h3>
<p>从每个<code>groupby</code>对象构造一个序列,并将其转换为字典以给出一系列字典值。最后,通过另一个<code>to_dict</code>调用将其转换为字典。你知道吗</p>
<pre><code>res = df.groupby('reviewerName')\
.apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
.to_dict()
</code></pre>
<h3><a href="https://docs.python.org/3/library/collections.html#collections.defaultdict" rel="nofollow noreferrer">^{<cd3>}</a></h3>
<p>定义<code>defaultdict</code>个<code>dict</code>对象并逐行迭代数据帧。你知道吗</p>
<pre><code>from collections import defaultdict
res = defaultdict(dict)
for row in df.itertuples(index=False):
res[row.reviewerName][row.title] = row.reviewerRatings
</code></pre>
<p>结果<code>defaultdict</code>不需要转换回常规<code>dict</code>,因为<code>defaultdict</code>是<code>dict</code>的子类。你知道吗</p>
<h3>绩效基准</h3>
<p>基准测试是建立和数据相关的。你应该用你自己的数据来测试,看看什么最有效。你知道吗</p>
<pre><code># Python 3.6.5, Pandas 0.19.2
from collections import defaultdict
from random import sample
# construct sample dataframe
np.random.seed(0)
n = 10**4 # number of rows
names = np.random.choice(['Charles', 'Lora', 'Katherine', 'Matthew',
'Mark', 'Luke', 'John'], n)
books = [f'Book_{i}' for i in sample(range(10**5), n)]
ratings = np.random.randint(0, 6, n)
df = pd.DataFrame({'reviewerName': names, 'title': books, 'reviewerRatings': ratings})
def jez(df):
return df.groupby('reviewerName')['title','reviewerRatings']\
.apply(lambda x: dict(x.values))\
.to_dict()
def jpp1(df):
return df.groupby('reviewerName')\
.apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
.to_dict()
def jpp2(df):
dd = defaultdict(dict)
for row in df.itertuples(index=False):
dd[row.reviewerName][row.title] = row.reviewerRatings
return dd
%timeit jez(df) # 33.5 ms per loop
%timeit jpp1(df) # 17 ms per loop
%timeit jpp2(df) # 21.1 ms per loop
</code></pre>