如何使用pyspark查找列的字符串语句中是否包含一个或多个单词

1条回答

网友

1楼 · 发布于 2024-10-03 06:32:00

这对您来说是一个可行的解决方案-使用高阶函数array_contains()而不是遍历每个项目，但是为了实现解决方案，我们需要稍微简化一下。例如需要将字符串列设置为数组

在这里创建数据框

from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.createDataFrame([(1,"This is a Horse"),(2,"Monkey Loves trees"),(3,"House has a tree"),(4,"The Ocean is Cold")],[ "col1","col2"])
df.show(truncate=False)

输出

+  +        -+
|col1|col2             |
+  +        -+
|1   |This is a Horse  |
|2   |Monkey Loves trees|
|3   |House has a tree |
|4   |The Ocean is Cold|
+  +        -+

逻辑在此-使用split（）将字符串列转换为ArrayType

df = df.withColumn("col2", F.split("col2", " "))
df = df.withColumn("array_filter", F.when(F.array_contains("col2", "This"), True).when(F.array_contains("col2", "tree"), True))
df = df.filter(F.col("array_filter") == True)
df.show(truncate=False)

输出

   +  +          -+      +
|col1|col2                 |array_filter|
+  +          -+      +
|1   |[This, is, a, Horse] |true        |
|3   |[House, has, a, tree]|true        |
+  +          -+      +

在这里创建数据框

输出

逻辑在此-使用split（）将字符串列转换为ArrayType

输出

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用pyspark查找列的字符串语句中是否包含一个或多个单词

在这里创建数据框

输出

逻辑在此-使用split（）将字符串列转换为ArrayType

输出

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >