删除Python中的几乎重复项

2024-05-02 11:08:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我得到了一个包含数百万行的数据集,我想消除彼此接近的行,其中差异低于阈值x

以下是数据集的一个示例:

1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 3
1 1 1 1 1 1 1 1 1 1 1 1 2
1 1 1 1 1 1 1 1 1 1 1 3 1
1 1 1 1 1 1 1 1 1 1 1 3 3
1 1 1 1 1 1 1 1 1 1 1 3 2

如您所见,除最后一列外,第1-3行和第4-6行是相同的。所以,给定阈值1,我希望结果是两行(1-3的一行和4-6的一行)。对于阈值2,我们可以看到除了最后两行之外,所有列的所有行都是相同的,因此结果应该只有一行。老实说,保留哪一行并不重要(第一行、最后一行、随机行)

我从另一个线程尝试了以下代码,但它只返回一行:(Removing *NEARLY* Duplicate Observations - Python

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np

#load dataframe
data = pd.read_csv(r'testfile.csv')
df = pd.DataFrame(data, columns=['File_1', 'File_2', 'File_3', 'File_4', 'File_5', 'File_6', 'File_7', 'File_8', 'File_9', 'File_10', 'File_11', 'File_12', 'File_13'])

def dedupe_partially_vectorized(df, threshold=1):
    """
    - Iterate through each row starting from the last; examine all previous rows for duplicates.
    - If found, it is appended to a list of duplicate indices.
    """
    # convert field data to integers
    enc = OrdinalEncoder()
    X = enc.fit_transform(df.to_numpy())

    """
    - loop through each row, starting from last
    - for each `row`, calculate hamming distance to all previous rows
    - if any such distance is `threshold` or less, mark `idx` as duplicate
    - loop ends at 2nd row (1st is by definition not a duplicate)
    """
    dupe_idx = []
    for j in range(len(X) - 1):
        idx = len(X) - j - 1
        row = X[idx]
        prev_rows = X[0:idx]
        dists = np.sum(row != prev_rows, axis=1)
        if min(dists) <= threshold:
            dupe_idx.append(idx)
        dupe_idx = sorted(dupe_idx)
    df_dupes = df.iloc[dupe_idx]
    df_deduped = df.drop(dupe_idx)
    return (df_deduped, df_dupes)

#send output to csv
(df_deduped, df_dupes) = dedupe_partially_vectorized(df)

print(df_deduped)

在上述数据集上运行代码时,返回:

   File_1  File_2  File_3  File_4  ...  File_10  File_11  File_12  File_13
0       1       1       1       1  ...        1        1        1        1

[1 rows x 13 columns]

Process finished with exit code 0

如果有人能帮我解决这个问题,我会很有帮助的

编辑:

在尝试了新代码之后,它并没有删除所有几乎所有的观察结果:例如,数据集应该只保留七行中的一行(现在保留5/7):

1 1 3 3 3 2 1 2 1 1 2 2 1 
1 3 3 3 3 2 1 2 1 1 2 2 1  
1 1 3 3 3 2 1 2 1 1 2 2 2  
1 1 3 3 3 2 3 2 1 1 2 2 1  
1 1 3 3 3 2 1 2 1 1 2 1 1  
1 1 3 3 3 2 1 2 1 1 2 3 1  
1 1 3 3 3 2 1 2 2 1 2 2 1

Tags: to数据代码fromimportdfas阈值
1条回答
网友
1楼 · 发布于 2024-05-02 11:08:11

drop_duplicates的单向unsgitertools.combinations

from itertools import combinations

n = 1
colsets = [c for c in combinations(df.columns, len(df.columns) - n)]
min((df.drop_duplicates(subset=c) for c in colsets), key=len)

输出:

   0   1   2   3   4   5   6   7   8   9   10  11  12
0   1   1   1   1   1   1   1   1   1   1   1   1   1
3   1   1   1   1   1   1   1   1   1   1   1   3   1

相关问题 更多 >