<p>pyspark中有一种方法,<strong>只有3个级别。请注意,在示例中,最后一行有4个级别,但失败了,希望不是您的案例,而是看到它</p>
<pre><code>import pandas as pd
import pyspark.sql.functions as F
# create toy data
pdf = pd.DataFrame({'child':list('ABCDEFGHIJKLM'),
'parent':['','A','A','B','B','C','C','E', '', 'I','J','K','L']})
# convert to spark dataframe
df = spark.createDataFrame(pdf)
# coalesce the column parent
df = df.withColumn('parent', F.when(F.col('parent')!='', F.col('parent'))
.otherwise(F.col('child')))
# do self join using alias to direct to the right columns
res = (
df.alias('df1')
.join(df.alias('df2'), F.col('df1.parent') == F.col('df2.child'))
.join(df.alias('df3'), F.col('df2.parent') == F.col('df3.child'))
.select(['df1.child', 'df1.parent',F.col('df3.parent').alias('highest_parent')])
)
</code></pre>
<p>你得到了什么</p>
<pre><code>res.orderBy('child').show()
+ -+ + +
|child|parent|highest_parent|
+ -+ + +
| A| A| A|
| B| A| A|
| C| A| A|
| D| B| A|
| E| B| A|
| F| C| A|
| G| C| A|
| H| E| A|
| I| I| I|
| J| I| I|
| K| J| I|
| L| K| I|
| M| L| J| < this one 4 levels so fail, could add a join if needed
+ -+ + +
</code></pre>