提取数据帧序列中的重复值和不存在的值

2024-10-06 07:10:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我刚接触熊猫

我有两个数据源-A和B

A和B都有一列,数据如下:

A

Cj0KCQiAiZPvBRDZARIsAORkq7fOa9HW8u6iqLm1KvTjAhWTrYoLeL_baPPO5WoiLHsHeVYUmFFxXa0aAvxKEALw_wcB
EAIaIQobChMImLDtsuSY5gIVR3RgCh1ckQ1fEAAYASAAEgJ4nvD_BwE
Cj0KCQiAiZPvBRDZARIsAORkq7fOa9HW8u6iqLm1KvTjAhWTrYoLeL_baPPO5WoiLHsHeVYUmFFxXa0aAvxKEALw_wcB
Cj0KCQiAiZPvBRDZARIsAORkq7enWHEermCPb4NKdGwnh2HQwUPftxai7nufoVPOgDHE8CE9_s0hSAIaArPJEALw_wcB
Cj0KCQiAiZPvBRDZARIsAORkq7fQm2PgqtRHrXGkzcBPsZo-1Rwm4Ln6RuSBLumtNeElnoASiyC49HAaAoTWEALw_wcB

B类

EAIaIQobChMI_tf0seSY5gIViKztCh1TbAAhEAAYASAAEgKcg_D_BwE
EAIaIQobChMImpyb_-OY5gIVET5gCh38Kw3bEAAYBCAAEgLmHfD_BwE
Cj0KCQiAiZPvBRDZARIsAORkq7fnlXGP7pfobqU5VFzlMPdPSjCKzSE6n43QSnkbQ264SVnX9kkSyHAaApudEALw_wcB
EAIaIQobChMIwvGQt-SY5gIVh6ztCh1c0gHQEAAYAyAAEgLqvPD_BwE
Cj0KCQiAiZPvBRDZARIsAORkq7ej_kXsK5XGwISOQTWUZoChlugerRH0Wcz4Wrpn1qJzlIkKxwqljCsaAhRNEALw_wcB

我将框架连接到一根柱子上,如下所示:

joined = pd.concat([A,B])

然后得到一个包含两个源的列。 接下来我创建新的dataframe,将joined存储在第一列,将B存储在第二列

final_export = pd.DataFrame()
final_export['A'] = joined
final_export['B'] = B

数据框如下所示:

最终出口

A                                                         B
EAIaIQobChMI_tf0seSY5gIViKztCh1TbAAhEAAYASAAEgKcg_D_BwE   EAIaIQobChMI_tf0seSY5gIViKztCh1TbAAhEAAYASAAEgKcg_D_BwE 
EAIaIQobChMImpyb_-OY5gIVET5gCh38Kw3bEAAYBCAAEgLmHfD_BwE   EAIaIQobChMI_tf0seSY5gIViKztCh1TbAAhEAAYASAAEgKcg_D_BwE
EAIaIQobChMIwvGQt-SY5gIVh6ztCh1c0gHQEAAYAyAAEgLqvPD_BwE
EAIaIQobChMI_tf0seSY5gIViKztCh1TbAAhEAAYASAAEgKcg_D_BwE
EAIaIQobChMImpyb_-OY5gIVET5gCh38Kw3bEAAYBCAAEgLmHfD_BwE
EAIaIQobChMIwvGQt-SY5gIVh6ztCh1c0gHQEAAYAyAAEgLqvPD_BwE
...

A列的条目比B列多

然后我创建了一个新的Dataframe,它有3个列-在两个列中,只有在A中,只有在B中。逻辑是,我有一个包含所有值的列表,我需要检查值是否存在于两个源中,并且仅存在于一个源中的值将被放置在仅A或仅B列中:

df_export = pd.DataFrame({'In both': pd.Series(np.intersect1d(final_export['A'], final_export['B'])),
                          'Only in A': pd.Series(np.setdiff1d(final_export['A'], final_export['B'])),
                          'Only in B': pd.Series(np.setdiff1d(final_export['B'], final_export['A']))})

但我得到一个错误:

TypeError: '<' not supported between instances of 'float' and 'str'

我尝试过对B列使用.fillna(''),因为它的条目比A列少,但仍然得到相同的错误

谢谢你的建议


Tags: 数据npexportfinalseriespdjoinedbwe
1条回答
网友
1楼 · 发布于 2024-10-06 07:10:35

这应该在纯python中完成,然后创建数据帧-因此:

setA = set(A)
setB = set(B)
both = setA & setB
only_in_A = setA - both
only_in_B = setB - both

然后创建数据帧,如中所示

tuples = [(item, item in both, item in only_in_A, item in only_in_B) 
          for item in setA | setB]
df = pd.DataFrame(tuples, columns=['value', 'in_both', 'in_A', 'in_B'])

相关问题 更多 >