在pysp中将原始df中的列添加到摸索df中 - 问答 - Python中文网

在pysp中将原始df中的列添加到摸索df中

2024-06-11 12:38:15 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

您好，我使用以下命令从原始数据帧创建了分组数据帧：

sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"))

我的spark_df数据帧有三列：Transaction、Products和CustomerID

我想把CustomerID列放入sp2数据帧（它不会被分组）。你知道吗

当我尝试用这个命令连接它时：

df_joined = sp2.join(spark_df, "CustomerID")

我收到了这个错误信息：

Py4JJavaError: An error occurred while calling o44.join. : org.apache.spark.sql.AnalysisException: USING column CustomerID cannot be resolved on the left side of the join. The left-side columns: [Transaction, items];

Tags： the 数据命令 df 原始数据 items left side

1条回答

网友

1楼 · 发布于 2024-06-11 12:38:15

发生此错误是因为在sp2数据帧中没有CustomerID列。所以你不能在CustomerID上加入他们。我建议您在sp2数据帧中用None值创建一个CustomerID列，然后在CustomerID列上用spark_df连接它。你知道吗

这是执行此操作的示例代码：

import pyspark.sql.functions as f

sp2 = sp2.withColumn('CustomerID', f.lit("None").cast(StringType()))

df_joined = sp2.join(spark_df, "CustomerID")

更新：向分组数据中添加CustomerID列的另一种方法是使用first函数：

import pyspark.sql.functions as F

sp2 = spark_df.drop_duplicates().groupBy('Transaction').agg(F.collect_list("Product").alias("items"), F.first('CustomerID').alias('CustomerID'))

相关问题更多 >

编程相关推荐

热门问题

热门文章