<p>您可以使用<a href="https://github.com/Lyonk71/pandas-dedupe" rel="nofollow noreferrer">pandas-dedupe</a>库删除数据集中的打字错误。<br/>
示例代码</p>
<pre><code>import pandas as pd
import pandas_dedupe
df = pd.DataFrame({'class': ['Iris-setosa', 'Iris-setossa', 'Iris-versicolor', 'Iris-virginica', 'versicolor', 'iris-setosa', 'versicolor']})
dd = pandas_dedupe.dedupe_dataframe(
df,
field_properties = ['class'],
sample_size=1,
canonicalize=True
)
# At this point pandas dedupe will ask you to label some records as distinct or duplicates.
# Once done, you hit finish ('f') and here is the output:
# class cluster id confidence canonical_class
# 0 iris-setosa 0 1.000000 iris-setosa
# 1 iris-setossa 0 1.000000 iris-setosa
# 2 iris-versicolor 1 0.998748 versicolor
# 3 iris-virginica 2 1.000000 iris-virginica
# 4 versicolor 1 0.999115 versicolor
# 5 iris-setosa 0 1.000000 iris-setosa
# 6 versicolor 1 0.999115 versicolor
</code></pre>
<p>如果您有一个干净的名称列表(即公报),您还可以尝试执行公报重复数据消除,通过将杂乱的数据与公报进行匹配来删除重复项。熊猫重复数据消除也支持地名索引重复数据消除</p>