<p>更新:这是一个<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html" rel="nofollow">^{<cd1>}</a>:</p>
<pre><code>In [11]: df1 = pd.crosstab(df['node'], df['precedingWord'])
In [12]: df1
Out[12]:
precedingWord a few some the
node
banana 2 0 0 1
coconut 0 1 1 1
In [13]: df2 = pd.crosstab(df['node'], df['comp'])
</code></pre>
<p>这显然是一个更干净(对大数据更有效的算法)。在</p>
<p>然后用axis=1的concat将它们粘在一起(即,添加更多列,而不是添加更多行)。在</p>
^{pr2}$
<p>我可能会这样(作为一个多重索引),如果你想让它变平,就不要传递键(尽管可能会出现重复单词的问题):</p>
<pre><code>In [15]: pd.concat([df1, df2], axis=1)
Out[15]:
a few some the lal lel lil
node
banana 2 0 0 1 1 2 0
coconut 0 1 1 1 1 1 1
</code></pre>
<p><em>旁白:当列名存在时,如果concat不要求显式地传递列名(作为键kwarg),那就更好了。。。</em></p>
<hr/>
<h3>原始答案</h3>
<p>可用于<code>value_counts</code>:</p>
<pre><code>In [21]: g = df.groupby("node")
In [22]: g["comp"].value_counts()
Out[22]:
node comp
banana lel 2
lal 1
coconut lal 1
lel 1
lil 1
dtype: int64
In [23]: g["precedingWord"].value_counts()
Out[23]:
node precedingWord
banana a 2
the 1
coconut few 1
some 1
the 1
dtype: int64
</code></pre>
<p>把这个放在一个单独的框架里有点棘手:</p>
<pre><code>In [24]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)])
Out[24]:
a few lal lel lil some the
node
banana NaN NaN 1 2 NaN NaN NaN
coconut NaN NaN 1 1 1 NaN NaN
banana 2 NaN NaN NaN NaN NaN 1
coconut NaN 1 NaN NaN NaN 1 1
In [25]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)]).fillna(0)
Out[25]:
a few lal lel lil some the
node
banana 0 0 1 2 0 0 0
coconut 0 0 1 1 1 0 0
banana 2 0 0 0 0 0 1
coconut 0 1 0 0 0 1 1
</code></pre>
<hr/>
<p><em>在执行concat之前,可以将列映射到det1、det2等,例如,如果将映射作为字典。</em></p>
<pre><code>In [31]: res = g["comp"].value_counts().unstack(1)
In [32]: res
Out[32]:
comp lal lel lil
node
banana 1 2 NaN
coconut 1 1 1
In [33]: res.columns = res.columns.map({"lal": "det1", "lel": "det2", "lil": "det3"}.get)
In [34]: res
Out[34]:
det1 det2 det3
node
banana 1 2 NaN
coconut 1 1 1
</code></pre>
<p>或者,你也可以使用列表理解法(如果你没有听写或者没有特定的标签):</p>
<pre><code>In [41]: res = g["comp"].value_counts().unstack(1)
In [42]: res.columns = ['det%s' % i for i, _ in enumerate(df.columns)]
</code></pre>