如何通过python在csv中查找重复的数据行并输出到新的csv

2024-06-25 05:54:23 发布

您现在位置:Python中文网/ 问答频道 /正文

当前正在查找重复项,但数据未显示行号、名称和编号,并且输出不正确(有关预期输出,请参阅下文)

以下是示例文件(已编辑链接): https://wetransfer.com/downloads/c7213abe1a80677bbadc6ddb8faceaf920211021094523/09ad80

预期产出(已编辑)

New expected output

当前结果

current result


Tags: 文件数据https名称com编辑示例链接
2条回答

发生这种情况是因为.duplicated返回一个布尔序列(真/假),您直接保存它

但是您应该使用它来子集数据,如下所示:

import pandas as pd
import os


df_state = pd.DataFrame(
                [["3 Liu Yu,876"],
                ["4 Koh chong,123"],
                ["3 Liu Yu,876"]])

df_state = df_state[0].str.split(" ", expand= True)
print(df_state, "\n")

duplicated = df_state.duplicated() # just a boolean series
print(duplicated, "\n")

print(df_state[duplicated], "\n")  ## <- subset and save with .to_csv

# as Anders Källmar points out, you can also do this:

all_duplicated = df_state.duplicated(keep= False)
print(df_state[all_duplicated])


输出:

   0    1          2
0  3  Liu     Yu,876
1  4  Koh  chong,123
2  3  Liu     Yu,876 

0    False
1    False
2     True
dtype: bool 

   0    1       2
2  3  Liu  Yu,876 

   0    1       2
0  3  Liu  Yu,876
2  3  Liu  Yu,876

使用df.duplicatedkeep=False获得dup行的布尔掩码,然后提取行:

# split name / number from your csv file
df = pd.read_csv('names_dup2.csv', quoting=1, header=None)[0] \
       .str.split('\t', expand=True)

# increment index to match line number
df.index += 1

# keep duplicate entries
out = df[df[0].duplicated(keep=False)]

# export to duplicated_data.csv
out.to_csv('duplicated_data.csv', header=False)

输出文件的内容:

15,ANDREW ZHAO CHONG,83091746
19,ANDREW ZHAO CHONG,83091746
26,ANDREW ZHAO CHONG,83091746
48,ANDREW ZHAO CHONG,83091746
53,KOH KANG RI,89943392
56,KOH KANG RI,89943392
63,ENOS ZHAO KANG SONG,80746554
66,ENOS ZHAO KANG SONG,80746554
80,ENOS ZHAO KANG SONG,80746554

单行版本

pd.read_csv('names_dup2.csv', quoting=1, header=None)[0] \
  .str.split('\t', expand=True) \
  .assign(index=lambda x: x.index+1) \
  .set_index('index') \
  [lambda x: x[0].duplicated(keep=False)] \
  .to_csv('duplicated_data.csv', header=False)

相关问题 更多 >