擅长:python、mysql、java
<p>检查一下<a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html" rel="nofollow noreferrer">duplicated()</a>方法。从文档中:</p>
<blockquote>
<p>Return boolean Series denoting duplicate rows.</p>
</blockquote>
<p>特别有用的是仅选择列子集的可选参数。根据您的具体目标,您可以使用<code>duplicated()</code>方法做几件事:</p>
<p>要识别重复的行,您需要使用</p>
<pre><code>duplicates = test.duplicated(subset = ['PROFILE URL', 'FINAL Addres'], keep = False)
</code></pre>
<p>要识别您将使用的所有重复用户</p>
<pre><code> duplicate_users = test[test.duplicated(subset = ['PROFILE URL', 'FINAL Addres'], keep = First)]
</code></pre>
<p>要返回不带副本的数据帧(以前的每个副本现在只存在一次):</p>
<pre><code>duplicates = test.duplicated(subset = ['PROFILE URL', 'FINAL Addres'])
duplicate_free_df = test.loc[~duplicates]
</code></pre>