PySpark 2.2中数组列的每个元素的子字符串

2条回答

网友

1楼 · 编辑于 2024-09-30 06:23:55

你的udf方法适合我。此外，您可以将transform与substring一起使用：

import pyspark.sql.functions as f

df.withColumn('new_column', f.expr('transform(col1, x -> substring(x, 0, 5))')).show()

+          +          +
|                col1|          new_column|
+          +          +
|[hello-123, abcde...|      [hello, abcde]|
|[hello-234, abcde...|[hello, abcde, xy...|
|[hiiii-111, abbbb...|[hiiii, abbbb, xy...|
+          +          +

网友

2楼 · 编辑于 2024-09-30 06:23:55

使用不同的方法解决了这个问题：分解数组，对元素进行子串，然后收集回数组

import pyspark.sql.functions as F
    
df1\
   .withColumn('idx', F.monotonically_increasing_id())\
   .withColumn('exploded_col', F.explode(col('col1')))\
   .withColumn('substr_col', F.substring(col('exploded_col'),1,5))\
   .groupBy(col('idx'))\
   .agg(F.collect_set('substr_col').alias('new_column'))

相关问题更多 >

编程相关推荐

热门问题

热门文章

PySpark 2.2中数组列的每个元素的子字符串

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >