在Python中将dataframe转换成具有两个列变量的频率列表问题的回答

在Python中将dataframe转换成具有两个列变量的频率列表

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个由列node、component和前面的单词组成的dataframe。节点包含许多相同的值（按字母顺序排序），组件也包含许多相同的值，但被置乱，前面的单词可以是所有类型的单词，但也有一些相同的单词。在 我现在要做的是创建某种横截面/频率列表，显示组件的频率和前面链接到节点的单词。在 假设这是我的测向： <pre><code>node precedingWord comp banana the lel banana a lel banana a lal coconut some lal coconut few lil coconut the lel </code></pre> 我期望一个频率列表，显示每个唯一的节点，以及在给定匹配条件的其他列中找到某个值的次数，例如 ^{pr2}$ 预期产量： <pre><code>node det1 det2 unspecified comp1 comp2 comp3 banana 2 1 0 2 0 1 coconut 0 1 0 1 1 1 </code></pre> 我已经为一个变量做过了，但我不知道如何将comp列放在适当的位置： <pre><code>det1 = ["a"] det2 = ["the"] df.loc[df.preceding_word.isin(det1), "determiner"] = "det1" df.loc[df.preceding_word.isin(det2), "determiner"] = "det2" df.loc[df.preceding_word.isin(det1 + det2) == 0, "determiner"] = "unspecified" # Create crosstab of the node and gender freqDf = pd.crosstab(df.node, df.determiner) </code></pre> 我从<a href="https://stackoverflow.com/questions/32095818/create-advanced-frequency-table-with-python">here</a>那里得到这个答案。如果有人能解释<code>loc</code>的作用，那也会有很大帮助。在 <hr/> 考虑到安迪的答案，我尝试了以下方法。注意，“predingword”已被“gender”取代，它只包含中性、非中性、性别等价值观。在 <pre><code>def frequency_list(): # Define content of gender classes neuter = ["het"] non_neuter = ["de"] # Add `gender` column to df df.loc[df.preceding_word.isin(neuter), "gender"] = "neuter" df.loc[df.preceding_word.isin(non_neuter), "gender"] = "non_neuter" df.loc[df.preceding_word.isin(neuter + non_neuter) == 0, "gender"] = "unspecified" g = df.groupby("node") # Create crosstab of the node, and gender and component freqDf = pd.concat([g["component"].value_counts().unstack(1), g["gender"].value_counts().unstack(1)]) # Reset indices, starting from 1, not the default 0! """ Crosstabs don't come with index, so we first set the index with `reset_index` and then alter it. """ freqDf.reset_index(inplace=True) freqDf.index = np.arange(1, len(freqDf) + 1) freqDf.to_csv("<a href="https://www.cnpython.com/pypi/dataset" class="inner-link">dataset</a>/py-frequencies.csv", sep="\t", encoding="utf-8") </code></pre> 输出接近我想要的，但不完全是： <a href="https://i.stack.imgur.com/hznuB.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/hznuB.png" alt="enter image description here"/></a> <ol> <li>交叉表不是“合并”的，换句话说：首先为comp（<code>component</code>）显示行，然后为<code>gender</code>显示相同的节点。在</li> <li>空值应该是<code>0</code>。在</li> <li>所有值都应该是整数，没有浮点。在</li> </ol> 我想要的是： <a href="https://i.stack.imgur.com/4IcXX.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/4IcXX.png" alt="enter image description here"/></a> 请注意，我正在寻找最有效的答案。我实际上是在处理加载，数据加载，所以每循环每秒都有计数！在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

更新：这是一个<a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html" rel="nofollow">^{<cd1>}</a>： <pre><code>In [11]: df1 = pd.crosstab(df['node'], df['precedingWord']) In [12]: df1 Out[12]: precedingWord a few some the node banana 2 0 0 1 coconut 0 1 1 1 In [13]: df2 = pd.crosstab(df['node'], df['comp']) </code></pre> 这显然是一个更干净（对大数据更有效的算法）。在 然后用axis=1的concat将它们粘在一起（即，添加更多列，而不是添加更多行）。在 ^{pr2}$ 我可能会这样（作为一个多重索引），如果你想让它变平，就不要传递键（尽管可能会出现重复单词的问题）： <pre><code>In [15]: pd.concat([df1, df2], axis=1) Out[15]: a few some the lal lel lil node banana 2 0 0 1 1 2 0 coconut 0 1 1 1 1 1 1 </code></pre> 旁白：当列名存在时，如果concat不要求显式地传递列名（作为键kwarg），那就更好了。。。 <hr/> <h3>原始答案</h3> 可用于<code>value_counts</code>： <pre><code>In [21]: g = df.groupby("node") In [22]: g["comp"].value_counts() Out[22]: node comp banana lel 2 lal 1 coconut lal 1 lel 1 lil 1 dtype: int64 In [23]: g["precedingWord"].value_counts() Out[23]: node precedingWord banana a 2 the 1 coconut few 1 some 1 the 1 dtype: int64 </code></pre> 把这个放在一个单独的框架里有点棘手： <pre><code>In [24]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)]) Out[24]: a few lal lel lil some the node banana NaN NaN 1 2 NaN NaN NaN coconut NaN NaN 1 1 1 NaN NaN banana 2 NaN NaN NaN NaN NaN 1 coconut NaN 1 NaN NaN NaN 1 1 In [25]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)]).fillna(0) Out[25]: a few lal lel lil some the node banana 0 0 1 2 0 0 0 coconut 0 0 1 1 1 0 0 banana 2 0 0 0 0 0 1 coconut 0 1 0 0 0 1 1 </code></pre> <hr/> 在执行concat之前，可以将列映射到det1、det2等，例如，如果将映射作为字典。 <pre><code>In [31]: res = g["comp"].value_counts().unstack(1) In [32]: res Out[32]: comp lal lel lil node banana 1 2 NaN coconut 1 1 1 In [33]: res.columns = res.columns.map({"lal": "det1", "lel": "det2", "lil": "det3"}.get) In [34]: res Out[34]: det1 det2 det3 node banana 1 2 NaN coconut 1 1 1 </code></pre> 或者，你也可以使用列表理解法（如果你没有听写或者没有特定的标签）： <pre><code>In [41]: res = g["comp"].value_counts().unstack(1) In [42]: res.columns = ['det%s' % i for i, _ in enumerate(df.columns)] </code></pre>

在Python中将dataframe转换成具有两个列变量的频率列表

1 个回答

相关Python问题