在Python中将dataframe转换成具有两个列变量的频率列表问题的回答

在Python中将dataframe转换成具有两个列变量的频率列表

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个由列node、component和前面的单词组成的dataframe。节点包含许多相同的值（按字母顺序排序），组件也包含许多相同的值，但被置乱，前面的单词可以是所有类型的单词，但也有一些相同的单词。在 我现在要做的是创建某种横截面/频率列表，显示组件的频率和前面链接到节点的单词。在 假设这是我的测向： <pre><code>node precedingWord comp banana the lel banana a lel banana a lal coconut some lal coconut few lil coconut the lel </code></pre> 我期望一个频率列表，显示每个唯一的节点，以及在给定匹配条件的其他列中找到某个值的次数，例如 ^{pr2}$ 预期产量： <pre><code>node det1 det2 unspecified comp1 comp2 comp3 banana 2 1 0 2 0 1 coconut 0 1 0 1 1 1 </code></pre> 我已经为一个变量做过了，但我不知道如何将comp列放在适当的位置： <pre><code>det1 = ["a"] det2 = ["the"] df.loc[df.preceding_word.isin(det1), "determiner"] = "det1" df.loc[df.preceding_word.isin(det2), "determiner"] = "det2" df.loc[df.preceding_word.isin(det1 + det2) == 0, "determiner"] = "unspecified" # Create crosstab of the node and gender freqDf = pd.crosstab(df.node, df.determiner) </code></pre> 我从<a href="https://stackoverflow.com/questions/32095818/create-advanced-frequency-table-with-python">here</a>那里得到这个答案。如果有人能解释<code>loc</code>的作用，那也会有很大帮助。在 <hr/> 考虑到安迪的答案，我尝试了以下方法。注意，“predingword”已被“gender”取代，它只包含中性、非中性、性别等价值观。在 <pre><code>def frequency_list(): # Define content of gender classes neuter = ["het"] non_neuter = ["de"] # Add `gender` column to df df.loc[df.preceding_word.isin(neuter), "gender"] = "neuter" df.loc[df.preceding_word.isin(non_neuter), "gender"] = "non_neuter" df.loc[df.preceding_word.isin(neuter + non_neuter) == 0, "gender"] = "unspecified" g = df.groupby("node") # Create crosstab of the node, and gender and component freqDf = pd.concat([g["component"].value_counts().unstack(1), g["gender"].value_counts().unstack(1)]) # Reset indices, starting from 1, not the default 0! """ Crosstabs don't come with index, so we first set the index with `reset_index` and then alter it. """ freqDf.reset_index(inplace=True) freqDf.index = np.arange(1, len(freqDf) + 1) freqDf.to_csv("<a href="https://www.cnpython.com/pypi/dataset" class="inner-link">dataset</a>/py-frequencies.csv", sep="\t", encoding="utf-8") </code></pre> 输出接近我想要的，但不完全是： <a href="https://i.stack.imgur.com/hznuB.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/hznuB.png" alt="enter image description here"/></a> <ol> <li>交叉表不是“合并”的，换句话说：首先为comp（<code>component</code>）显示行，然后为<code>gender</code>显示相同的节点。在</li> <li>空值应该是<code>0</code>。在</li> <li>所有值都应该是整数，没有浮点。在</li> </ol> 我想要的是： <a href="https://i.stack.imgur.com/4IcXX.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/4IcXX.png" alt="enter image description here"/></a> 请注意，我正在寻找最有效的答案。我实际上是在处理加载，数据加载，所以每循环每秒都有计数！在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

你的问题至少可以分成三部分： <ul> <li>如何分组和透视表？在</li> <li>如何合并表？在</li> <li><code>loc</code>在做什么？在</li> </ul> <h2>一般说明</h2> Pandas为某些操作提供了加速，所以在使用循环之前尝试库实现（见下文） <h2>旋转</h2> 1.与普通熊猫： <pre><code>df = pd.DataFrame({"det":["a","the","a","a","a", "the"], "word":["cat", "pet", "pet", "cat","pet", "pet"]}) "you will need a dummy variable:" df["counts"] = 1 "you probably need to reset the index" df_counts = df.groupby(["det","word"]).agg("count").reset_index() # det word counts #0 a cat 2 #1 a pet 3 #2 the pet 1 "and pivot it" df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0) #det a the #word #cat 2 0 #pet 3 1 </code></pre> 两列示例： ^{pr2}$ 2.使用<code>Counter</code> <pre><code>df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]}) acounter = Counter( (tuple(x) for x in df.as_matrix()) ) #Counter({('a', 'cat'): 2, ('a', 'pet'): 2, ('the', 'pet'): 2}) df_counts = pd.DataFrame(list(zip([y[0] for y in acounter.keys()], [y[1] for y in acounter.keys()], acounter.values())), columns=["det", "word", "counts"]) # det word counts #0 a cat 2 #1 the pet 1 #2 a pet 3 df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0) #det a the #word #cat 2 0 #pet 3 1 </code></pre> 在我的例子中，这个比纯<code>pandas</code>快一点（分组时每个循环52.6µs vs 92.9µs；不计算旋转） 3.据我所知，这是一个自然语言处理问题。您可以尝试将所有数据组合成一个字符串，并使用<code>sklearn</code>中的<a href="http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction" rel="nofollow">^{<cd4>}</a>并设置<code>ngram_range=(1, 2)</code>。比如： <pre><code>df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]}) from sklearn.feature_extraction.text import CountVectorizer listofpairs = [] for _, row in df.iterrows(): listofpairs.append(" ".join(row)) countvect = CountVectorizer(ngram_range=(2,2), min_df = 0.0, token_pattern='(?u)\\b\\w+\\b') sparse_counts = countvect.fit_transform(listofpairs) print("* input list:\n",listofpairs) print("* array of counts:\n",sparse_counts.toarray()) print("* vocabulary [order of columns in the sparse array]:\n",countvect.vocabulary_) counter_keys = [x[1:] for x in sorted([ tuple([v] + k.split(" ")) for k,v in countvect.vocabulary_.items()])] counter_values = np.sum(sparse_counts.toarray(), 0) df_counts = pd.DataFrame([(x[0], x[1], y) for x,y in zip(counter_keys, counter_values)], columns=["det", "word", "counts"]) </code></pre> <h2>合并</h2> 两种选择： 1<code>concat</code> df1.设置索引（“word”） df2.set_索引（“word”）数据输出=帕金森病（[df1，df2]，轴=1） 2.<code>merge</code> <h2><code>loc</code></h2> 它用一个参数索引行或用两个参数索引<code>row,column</code>。它使用行/列名称或布尔索引（如您对行的情况）。在 如果每个性别只有一篇文章，可以使用直接比较而不是<code>in</code>操作，这可能会加快速度： <pre><code>df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter" </code></pre> 与 <pre><code>indices_neutral = df["precedingWord"]=="de" df.loc[indices, "gender"] = "neuter" </code></pre> 或者更短但可读性较差 <pre><code>df.loc[df["precedingWord"]=="de", "gender"] = "neuter" </code></pre>

在Python中将dataframe转换成具有两个列变量的频率列表

1 个回答

相关Python问题