Spark数据帧更新值

2024-09-26 18:05:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我有3个数据帧:

1. Item dataframe:

+-------+---------+
|id_item|item_code|
+-------+---------+
|    991|    A0049|
|    992|    C1248|
|    993|    C0860|
|    994|    C0757|
|    995|    C0682|
+-------+---------+

以及

2. User dataframe:

+------+--------+
|id_usn|     usn|
+------+--------+
|417567|39063291|
|417568|39063294|
|417569|39063334|
|417570|39063353|
|417571|39063376|
+------+--------+

以及

3. Summary dataframe

+-------+--------------------+
|id_item|     summary        |
+-------+--------------------+
|    991|[[417567,0.579901...|
|    992|[[417567,0.001029...|
|    443|[[417585,0.219624...|
+-------+--------------------+

and schema of this dataFrame:

root
 |-- id_item: integer (nullable = true)
 |-- summary: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id_usn: long (nullable = true)
 |    |    |-- rating: double (nullable = true)

现在,id\u usn在StructType中,我想用User DataFrame中的usn替换Summary DataFrame中的id\u usn

我在用火花!你知道吗

请帮我解决这个问题!你知道吗


Tags: 数据idtruedataframecodesummaryitemuser
1条回答
网友
1楼 · 发布于 2024-09-26 18:05:08

希望有帮助。你知道吗

 from pyspark.sql import functions as F

 sdf1 = summarydf.select('id_item','summary',F.explode('summary').alias('col_summary')).select('*',F.col('col_summary').id_usn.alias('id_usn'),F.col('col_summary').rating.alias('rating')).drop('col_summary')
 df  = sdf1.join(itemdf,'id_item').join(userdf,'id_usn').select('item_code',F.struct('usn','rating').alias('tmpcol')).groupby('item_code').agg(F.collect_list('tmpcol').alias('summary'))
+    -+          +
|item_code|             summary|
+    -+          +
|    C1248|[[39063291,0.0010...|
|    A0049|[[39063291,0.5799...|
+    -+          +

相关问题 更多 >

    热门问题