<p>在<code>pandas</code>中,通常有一种比用<code>for</code>循环遍历数据帧更好的方法</p>
<p>在这种情况下,您真正想要的是将相等的tweet分组在一起,只保留第一条tweet。这可以通过<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html" rel="nofollow noreferrer">^{<cd3>}</a>实现:</p>
<pre><code>import random
import string
import pandas as pd
# some random one character tweets, so there are many duplicates
df = pd.DataFrame({"Tweets": random.choices(string.ascii_lowercase, k=100),
"Data": [random.random() for _ in range(100)]})
df.groupby("Tweets", as_index=False).first()
# Tweets Data
# 0 a 0.327766
# 1 b 0.677697
# 2 c 0.517186
# 3 d 0.925312
# 4 e 0.748902
# 5 f 0.353826
# 6 g 0.991566
# 7 h 0.761849
# 8 i 0.488769
# 9 j 0.501704
# 10 k 0.737816
# 11 l 0.428117
# 12 m 0.650945
# 13 n 0.530866
# 14 o 0.337835
# 15 p 0.567097
# 16 q 0.130282
# 17 r 0.619664
# 18 s 0.365220
# 19 t 0.005407
# 20 u 0.905659
# 21 v 0.495603
# 22 w 0.511894
# 23 x 0.094989
# 24 y 0.089003
# 25 z 0.511532
</code></pre>
<p>更妙的是,甚至有一个显式函数<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html" rel="nofollow noreferrer">^{<cd4>}</a>,它的速度大约是它的两倍:</p>
<pre><code>df.drop_duplicates(subset="Tweets", keep="first")
</code></pre>