import pandas as pd
import os
df_state = pd.DataFrame(
[["3 Liu Yu,876"],
["4 Koh chong,123"],
["3 Liu Yu,876"]])
df_state = df_state[0].str.split(" ", expand= True)
print(df_state, "\n")
duplicated = df_state.duplicated() # just a boolean series
print(duplicated, "\n")
print(df_state[duplicated], "\n") ## <- subset and save with .to_csv
# as Anders Källmar points out, you can also do this:
all_duplicated = df_state.duplicated(keep= False)
print(df_state[all_duplicated])
输出:
0 1 2
0 3 Liu Yu,876
1 4 Koh chong,123
2 3 Liu Yu,876
0 False
1 False
2 True
dtype: bool
0 1 2
2 3 Liu Yu,876
0 1 2
0 3 Liu Yu,876
2 3 Liu Yu,876
# split name / number from your csv file
df = pd.read_csv('names_dup2.csv', quoting=1, header=None)[0] \
.str.split('\t', expand=True)
# increment index to match line number
df.index += 1
# keep duplicate entries
out = df[df[0].duplicated(keep=False)]
# export to duplicated_data.csv
out.to_csv('duplicated_data.csv', header=False)
输出文件的内容:
15,ANDREW ZHAO CHONG,83091746
19,ANDREW ZHAO CHONG,83091746
26,ANDREW ZHAO CHONG,83091746
48,ANDREW ZHAO CHONG,83091746
53,KOH KANG RI,89943392
56,KOH KANG RI,89943392
63,ENOS ZHAO KANG SONG,80746554
66,ENOS ZHAO KANG SONG,80746554
80,ENOS ZHAO KANG SONG,80746554
发生这种情况是因为.duplicated返回一个布尔序列(真/假),您直接保存它
但是您应该使用它来子集数据,如下所示:
输出:
使用
df.duplicated
和keep=False
获得dup行的布尔掩码,然后提取行:输出文件的内容:
单行版本
相关问题 更多 >
编程相关推荐