删除Python中的几乎重复项

2024-05-02 11:08:11 发布

您现在位置：Python中文网/ 问答频道 /正文

3007

网友

男 | 程序猿一只，喜欢编程写python代码。

我得到了一个包含数百万行的数据集，我想消除彼此接近的行，其中差异低于阈值x

以下是数据集的一个示例：

1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 3
1 1 1 1 1 1 1 1 1 1 1 1 2
1 1 1 1 1 1 1 1 1 1 1 3 1
1 1 1 1 1 1 1 1 1 1 1 3 3
1 1 1 1 1 1 1 1 1 1 1 3 2

如您所见，除最后一列外，第1-3行和第4-6行是相同的。所以，给定阈值1，我希望结果是两行（1-3的一行和4-6的一行）。对于阈值2，我们可以看到除了最后两行之外，所有列的所有行都是相同的，因此结果应该只有一行。老实说，保留哪一行并不重要（第一行、最后一行、随机行）

我从另一个线程尝试了以下代码，但它只返回一行：（Removing *NEARLY* Duplicate Observations - Python）

from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np

#load dataframe
data = pd.read_csv(r'testfile.csv')
df = pd.DataFrame(data, columns=['File_1', 'File_2', 'File_3', 'File_4', 'File_5', 'File_6', 'File_7', 'File_8', 'File_9', 'File_10', 'File_11', 'File_12', 'File_13'])

def dedupe_partially_vectorized(df, threshold=1):
    """
    - Iterate through each row starting from the last; examine all previous rows for duplicates.
    - If found, it is appended to a list of duplicate indices.
    """
    # convert field data to integers
    enc = OrdinalEncoder()
    X = enc.fit_transform(df.to_numpy())

    """
    - loop through each row, starting from last
    - for each `row`, calculate hamming distance to all previous rows
    - if any such distance is `threshold` or less, mark `idx` as duplicate
    - loop ends at 2nd row (1st is by definition not a duplicate)
    """
    dupe_idx = []
    for j in range(len(X) - 1):
        idx = len(X) - j - 1
        row = X[idx]
        prev_rows = X[0:idx]
        dists = np.sum(row != prev_rows, axis=1)
        if min(dists) <= threshold:
            dupe_idx.append(idx)
        dupe_idx = sorted(dupe_idx)
    df_dupes = df.iloc[dupe_idx]
    df_deduped = df.drop(dupe_idx)
    return (df_deduped, df_dupes)

#send output to csv
(df_deduped, df_dupes) = dedupe_partially_vectorized(df)

print(df_deduped)

在上述数据集上运行代码时，返回：

   File_1  File_2  File_3  File_4  ...  File_10  File_11  File_12  File_13
0       1       1       1       1  ...        1        1        1        1

[1 rows x 13 columns]

Process finished with exit code 0

如果有人能帮我解决这个问题，我会很有帮助的

编辑：

在尝试了新代码之后，它并没有删除所有几乎所有的观察结果：例如，数据集应该只保留七行中的一行（现在保留5/7）：

1 1 3 3 3 2 1 2 1 1 2 2 1 
1 3 3 3 3 2 1 2 1 1 2 2 1  
1 1 3 3 3 2 1 2 1 1 2 2 2  
1 1 3 3 3 2 3 2 1 1 2 2 1  
1 1 3 3 3 2 1 2 1 1 2 1 1  
1 1 3 3 3 2 1 2 1 1 2 3 1  
1 1 3 3 3 2 1 2 2 1 2 2 1

Tags： to 数据代码 from import df as 阈值

1条回答

网友

1楼 · 发布于 2024-05-02 11:08:11

带drop_duplicates的单向unsgitertools.combinations：

from itertools import combinations

n = 1
colsets = [c for c in combinations(df.columns, len(df.columns) - n)]
min((df.drop_duplicates(subset=c) for c in colsets), key=len)

输出：

   0   1   2   3   4   5   6   7   8   9   10  11  12
0   1   1   1   1   1   1   1   1   1   1   1   1   1
3   1   1   1   1   1   1   1   1   1   1   1   3   1

删除Python中的几乎重复项

相关问题更多 >

编程相关推荐

热门问题

热门文章

删除Python中的几乎重复项

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >