
2024-05-02 11:08:11 发布

您现在位置:Python中文网/ 问答频道 /正文



1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 3
1 1 1 1 1 1 1 1 1 1 1 1 2
1 1 1 1 1 1 1 1 1 1 1 3 1
1 1 1 1 1 1 1 1 1 1 1 3 3
1 1 1 1 1 1 1 1 1 1 1 3 2


我从另一个线程尝试了以下代码,但它只返回一行:(Removing *NEARLY* Duplicate Observations - Python

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np

#load dataframe
data = pd.read_csv(r'testfile.csv')
df = pd.DataFrame(data, columns=['File_1', 'File_2', 'File_3', 'File_4', 'File_5', 'File_6', 'File_7', 'File_8', 'File_9', 'File_10', 'File_11', 'File_12', 'File_13'])

def dedupe_partially_vectorized(df, threshold=1):
    - Iterate through each row starting from the last; examine all previous rows for duplicates.
    - If found, it is appended to a list of duplicate indices.
    # convert field data to integers
    enc = OrdinalEncoder()
    X = enc.fit_transform(df.to_numpy())

    - loop through each row, starting from last
    - for each `row`, calculate hamming distance to all previous rows
    - if any such distance is `threshold` or less, mark `idx` as duplicate
    - loop ends at 2nd row (1st is by definition not a duplicate)
    dupe_idx = []
    for j in range(len(X) - 1):
        idx = len(X) - j - 1
        row = X[idx]
        prev_rows = X[0:idx]
        dists = np.sum(row != prev_rows, axis=1)
        if min(dists) <= threshold:
        dupe_idx = sorted(dupe_idx)
    df_dupes = df.iloc[dupe_idx]
    df_deduped = df.drop(dupe_idx)
    return (df_deduped, df_dupes)

#send output to csv
(df_deduped, df_dupes) = dedupe_partially_vectorized(df)



   File_1  File_2  File_3  File_4  ...  File_10  File_11  File_12  File_13
0       1       1       1       1  ...        1        1        1        1

[1 rows x 13 columns]

Process finished with exit code 0




1 1 3 3 3 2 1 2 1 1 2 2 1 
1 3 3 3 3 2 1 2 1 1 2 2 1  
1 1 3 3 3 2 1 2 1 1 2 2 2  
1 1 3 3 3 2 3 2 1 1 2 2 1  
1 1 3 3 3 2 1 2 1 1 2 1 1  
1 1 3 3 3 2 1 2 1 1 2 3 1  
1 1 3 3 3 2 1 2 2 1 2 2 1

Tags: to数据代码fromimportdfas阈值
1楼 · 发布于 2024-05-02 11:08:11


from itertools import combinations

n = 1
colsets = [c for c in combinations(df.columns, len(df.columns) - n)]
min((df.drop_duplicates(subset=c) for c in colsets), key=len)


   0   1   2   3   4   5   6   7   8   9   10  11  12
0   1   1   1   1   1   1   1   1   1   1   1   1   1
3   1   1   1   1   1   1   1   1   1   1   1   3   1

相关问题 更多 >