如何使用哈希md5方法在pyspark数据帧中查找重复项？

col_list = ['a', 'b', 'c'] df1 = df1.withColumn("hash", md5(concat_ws('+', *col_list))) df2 = df2.withColumn("hash", md5(concat_ws('+', *col_list))) upd: df1 upd: df2 |a |b |c |hash | |a |b |c |hash | |5 |2 |3 |sfsd23| |5 |2 |3 |sfsd23| |1 |5 |4 |fsd345| |5 |2 |3 |sfsd23| |1 |5 |3 |54sgsr| |1 |5 |4 |fsd345| |1 |5 |3 |54sgsr| df_diff = df1.select(df1.hash).substract(df2.select(df2.hash))

1条回答

网友

1楼 · 发布于 2024-09-30 14:25:15

使用.exceptAll（来自Spark-2.4+而不是.substract，因为.exceptAll通过使用df2作为源数据帧来保留所有重复的行

From docs:

subtract:

Return a new DataFrame containing rows in this DataFrame but not in another DataFrame.

This is equivalent to EXCEPT DISTINCT in SQL.

exceptAll:

Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.

This is equivalent to EXCEPT ALL in SQL.

Example:

From Spark-2.4+:

df2.exceptAll(df1).show(10,False)
#+ -+ -+ -+                +
#|a  |b  |c  |hash                            |
#+ -+ -+ -+                +
#|5  |2  |3  |747d9c66398e89fbda6570f6bf945ed6|
#+ -+ -+ -+                +

For Spark<2.4:

我们需要在上使用Sparkwindow row_number functi，然后使用left join查找重复记录

Example:

from pyspark.sql.functions import row_number
from pyspark.sql import Window

w=Window.partitionBy(*['a', 'b', 'c', 'hash']).orderBy(lit(1))


df1=df1.withColumn("rn",row_number().over(w))
df2=df2.withColumn("rn",row_number().over(w))

df2.alias("d2").join(df1.alias("d1"),(df1["a"]==df2["a"]) & (df1["b"]==df2["b"]) & (df1["c"]==df2["c"]) & (df1["hash"]==df2["hash"]) & (df1["rn"]==df2["rn"]),'left').\
filter(col("d1.rn").isNull()).\
select("d2.*").\
drop("rn").\
show()

#+ -+ -+ -+                +
#|a  |b  |c  |hash                            |
#+ -+ -+ -+                +
#|5  |2  |3  |747d9c66398e89fbda6570f6bf945ed6|
#+ -+ -+ -+                +

df_diff.show（）--无

相关问题更多 >

编程相关推荐

热门问题

热门文章