Pypark正则表达式联接

1条回答

网友

1楼 · 发布于 2024-09-25 08:39:27

regexp_replace可用于：

df1 = spark.createDataFrame([
    ("bob1", "a.b.*.d"),("bob2","a.b.c")], ["col1", "col2"])
df2 = spark.createDataFrame([
    ("tom1", "a.b.c.d"),("tom2","a.b.c")], ["col3", "col4"])
df1 = df1.withColumn("join_col", F.concat(F.lit("^"), F.regexp_replace(F.col("col2"), "\\*", "(\\\\w+)"), F.lit("$")))
df_joined = df1.join(df2, F.expr("col4 rlike join_col"))
df_joined.show()

印刷品

+  +   -+      -+  +   -+
|col1|   col2|     join_col|col3|   col4|
+  +   -+      -+  +   -+
|bob1|a.b.*.d|^a.b.(\w+).d$|tom1|a.b.c.d|
|bob2|  a.b.c|      ^a.b.c$|tom2|  a.b.c|
+  +   -+      -+  +   -+

可以省略\w+周围的括号

不幸的是，df_joined.explain()表明rlike连接导致CartesianProduct：

== Physical Plan ==
CartesianProduct col4#5 RLIKE join_col#26
:- *(1) Project [col1#0, col2#1, concat(^, regexp_replace(col2#1, \*, (\\w+)), $) AS join_col#26]
:  +- *(1) Filter isnotnull(concat(^, regexp_replace(col2#1, \*, (\\w+)), $))
:     +- *(1) Scan ExistingRDD[col1#0,col2#1]
+- *(2) Filter isnotnull(col4#5)
   +- *(2) Scan ExistingRDD[col3#4,col4#5]

相关问题更多 >

编程相关推荐

热门问题

热门文章

Pypark正则表达式联接

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >