我得到了一个包含数百万行的数据集,我想消除彼此接近的行,其中差异低于阈值x
以下是数据集的一个示例:
1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 3
1 1 1 1 1 1 1 1 1 1 1 1 2
1 1 1 1 1 1 1 1 1 1 1 3 1
1 1 1 1 1 1 1 1 1 1 1 3 3
1 1 1 1 1 1 1 1 1 1 1 3 2
如您所见,除最后一列外,第1-3行和第4-6行是相同的。所以,给定阈值1,我希望结果是两行(1-3的一行和4-6的一行)。对于阈值2,我们可以看到除了最后两行之外,所有列的所有行都是相同的,因此结果应该只有一行。老实说,保留哪一行并不重要(第一行、最后一行、随机行)
我从另一个线程尝试了以下代码,但它只返回一行:(Removing *NEARLY* Duplicate Observations - Python)
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np
#load dataframe
data = pd.read_csv(r'testfile.csv')
df = pd.DataFrame(data, columns=['File_1', 'File_2', 'File_3', 'File_4', 'File_5', 'File_6', 'File_7', 'File_8', 'File_9', 'File_10', 'File_11', 'File_12', 'File_13'])
def dedupe_partially_vectorized(df, threshold=1):
"""
- Iterate through each row starting from the last; examine all previous rows for duplicates.
- If found, it is appended to a list of duplicate indices.
"""
# convert field data to integers
enc = OrdinalEncoder()
X = enc.fit_transform(df.to_numpy())
"""
- loop through each row, starting from last
- for each `row`, calculate hamming distance to all previous rows
- if any such distance is `threshold` or less, mark `idx` as duplicate
- loop ends at 2nd row (1st is by definition not a duplicate)
"""
dupe_idx = []
for j in range(len(X) - 1):
idx = len(X) - j - 1
row = X[idx]
prev_rows = X[0:idx]
dists = np.sum(row != prev_rows, axis=1)
if min(dists) <= threshold:
dupe_idx.append(idx)
dupe_idx = sorted(dupe_idx)
df_dupes = df.iloc[dupe_idx]
df_deduped = df.drop(dupe_idx)
return (df_deduped, df_dupes)
#send output to csv
(df_deduped, df_dupes) = dedupe_partially_vectorized(df)
print(df_deduped)
在上述数据集上运行代码时,返回:
File_1 File_2 File_3 File_4 ... File_10 File_11 File_12 File_13
0 1 1 1 1 ... 1 1 1 1
[1 rows x 13 columns]
Process finished with exit code 0
如果有人能帮我解决这个问题,我会很有帮助的
编辑:
在尝试了新代码之后,它并没有删除所有几乎所有的观察结果:例如,数据集应该只保留七行中的一行(现在保留5/7):
1 1 3 3 3 2 1 2 1 1 2 2 1
1 3 3 3 3 2 1 2 1 1 2 2 1
1 1 3 3 3 2 1 2 1 1 2 2 2
1 1 3 3 3 2 3 2 1 1 2 2 1
1 1 3 3 3 2 1 2 1 1 2 1 1
1 1 3 3 3 2 1 2 1 1 2 3 1
1 1 3 3 3 2 1 2 2 1 2 2 1
带
drop_duplicates
的单向unsgitertools.combinations
:输出:
相关问题 更多 >
编程相关推荐