<p>你的问题至少可以分成三部分:</p>
<ul>
<li>如何分组和透视表?在</li>
<li>如何合并表?在</li>
<li><code>loc</code>在做什么?在</li>
</ul>
<h2>一般说明</h2>
<p>Pandas为某些操作提供了加速,所以在使用循环之前尝试库实现(见下文)</p>
<h2>旋转</h2>
<p><strong>1.</strong>与普通熊猫:</p>
<pre><code>df = pd.DataFrame({"det":["a","the","a","a","a", "the"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
"you will need a dummy variable:"
df["counts"] = 1
"you probably need to reset the index"
df_counts = df.groupby(["det","word"]).agg("count").reset_index()
# det word counts
#0 a cat 2
#1 a pet 3
#2 the pet 1
"and pivot it"
df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0)
#det a the
#word
#cat 2 0
#pet 3 1
</code></pre>
<p><em>两列示例:</em></p>
^{pr2}$
<p><strong>2.</strong>使用<code>Counter</code></p>
<pre><code>df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
acounter = Counter( (tuple(x) for x in df.as_matrix()) )
#Counter({('a', 'cat'): 2, ('a', 'pet'): 2, ('the', 'pet'): 2})
df_counts = pd.DataFrame(list(zip([y[0] for y in acounter.keys()], [y[1] for y in acounter.keys()], acounter.values())), columns=["det", "word", "counts"])
# det word counts
#0 a cat 2
#1 the pet 1
#2 a pet 3
df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0)
#det a the
#word
#cat 2 0
#pet 3 1
</code></pre>
<p>在我的例子中,这个比纯<code>pandas</code>快一点(分组时每个循环52.6µs vs 92.9µs;不计算旋转)</p>
<p><strong>3.</strong>据我所知,这是一个自然语言处理问题。您可以尝试将所有数据组合成一个字符串,并使用<code>sklearn</code>中的<a href="http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction" rel="nofollow">^{<cd4>}</a>并设置<code>ngram_range=(1, 2)</code>。比如:</p>
<pre><code>df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
from sklearn.feature_extraction.text import CountVectorizer
listofpairs = []
for _, row in df.iterrows():
listofpairs.append(" ".join(row))
countvect = CountVectorizer(ngram_range=(2,2), min_df = 0.0, token_pattern='(?u)\\b\\w+\\b')
sparse_counts = countvect.fit_transform(listofpairs)
print("* input list:\n",listofpairs)
print("* array of counts:\n",sparse_counts.toarray())
print("* vocabulary [order of columns in the sparse array]:\n",countvect.vocabulary_)
counter_keys = [x[1:] for x in sorted([ tuple([v] + k.split(" ")) for k,v in countvect.vocabulary_.items()])]
counter_values = np.sum(sparse_counts.toarray(), 0)
df_counts = pd.DataFrame([(x[0], x[1], y) for x,y in zip(counter_keys, counter_values)], columns=["det", "word", "counts"])
</code></pre>
<h2>合并</h2>
<p>两种选择:
1<code>concat</code>
df1.设置索引(“word”)
df2.set_索引(“word”)
数据输出=帕金森病([df1,df2],轴=1)</p>
<p>2.<code>merge</code></p>
<h2><code>loc</code></h2>
<p>它用一个参数索引行或用两个参数索引<code>row,column</code>。它使用行/列名称或布尔索引(如您对行的情况)。在</p>
<p>如果每个性别只有一篇文章,可以使用直接比较而不是<code>in</code>操作,这可能会加快速度:</p>
<pre><code>df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
</code></pre>
<p>与</p>
<pre><code>indices_neutral = df["precedingWord"]=="de"
df.loc[indices, "gender"] = "neuter"
</code></pre>
<p>或者更短但可读性较差</p>
<pre><code>df.loc[df["precedingWord"]=="de", "gender"] = "neuter"
</code></pre>