Pypsark函数，用于连接不同列的唯一值

2条回答

网友

1楼 · 编辑于 2024-06-02 22:10:14

如果您有Spark 2.4+

df.withColumn("concat", F.array_union(df.p1, df.p2))\
    .withColumn("concat", F.array_distinct(df.concat)).show()

对于Spark 2.3及以下版本

from pyspark.sql import functions as F

def concat_array(col1, col2):
     return list(set((list() if col1 is None else col1) + (list() if col2 is None else col2)))

concat_array_udf = F.udf(concat_array, ArrayType(IntegerType()))

df.withColumn('concat', concat_array_udf(df.p1, df.p2)).show()
+ -+ -+  +   +
| id| p1|  p2|concat|
+ -+ -+  +   +
|foo|[1]|null|   [1]|
|bar|[2]| [2]|   [2]|
|loo|[3]| [4]|[3, 4]|
+ -+ -+  +   +

网友

2楼 · 编辑于 2024-06-02 22:10:14

Hi to concat值如果存在唯一值，则可以使用以下代码。我使用lambda函数分析所有数据帧行，并声明check_unique_vlaues（），它为分析的行返回uniques值

def check_unique_vlaues(first, second):
    if first == second:
        return first
    else:
        return [first, second]

df['p3'] = df.apply(lambda x: check_unique_vlaues(x.p1, x.p2), axis=1)

编辑：

要从一行中的所有列中获取唯一值，而无需先获取，我们可以使用适用于pandas系列的unique()函数

def func(row):
    row = row[1:]
    return row.unique()

df['concat'] = df.apply(lambda x: func(x), axis=1)

相关问题更多 >

编程相关推荐

热门问题

热门文章

Pypsark函数，用于连接不同列的唯一值

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >