<p>您需要使用<a href="http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.when" rel="nofollow noreferrer">^{<cd1>}</a>来实现适当的连接。除此之外,您使用<code>outer</code>join的方式几乎是正确的。你知道吗</p>
<p>您需要检查这两列中是否有人是<a href="http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.Column.isNull" rel="nofollow noreferrer">^{<cd3>}</a>或<a href="http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.Column.isNotNull" rel="nofollow noreferrer">^{<cd4>}</a>,然后执行<a href="http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.functions.concat" rel="nofollow noreferrer">^{<cd5>}</a>。你知道吗</p>
<pre><code>from pyspark.sql.functions import col, when, concat
df1 = sqlContext.createDataFrame([('a','foo'),('b','bar'),('c','egg'),('d','fog')],['id','some_string'])
df2 = sqlContext.createDataFrame([('a','hoi'),('b','hei'),('c','hai'),('e','hui')],['id','some_string'])
df_outer_join=df1.join(df2.withColumnRenamed('some_string','some_string_x'), ['id'], how='outer')
df_outer_join.show()
+ -+ -+ -+
| id|some_string|some_string_x|
+ -+ -+ -+
| e| null| hui|
| d| fog| null|
| c| egg| hai|
| b| bar| hei|
| a| foo| hoi|
+ -+ -+ -+
df_outer_join = df_outer_join.withColumn('some_string_concat',
when(col('some_string').isNotNull() & col('some_string_x').isNotNull(),concat(col('some_string'),col('some_string_x')))
.when(col('some_string').isNull() & col('some_string_x').isNotNull(),col('some_string_x'))
.when(col('some_string').isNotNull() & col('some_string_x').isNull(),col('some_string')))\
.drop('some_string','some_string_x')
df_outer_join.show()
+ -+ +
| id|some_string_concat|
+ -+ +
| e| hui|
| d| fog|
| c| egghai|
| b| barhei|
| a| foohoi|
+ -+ +
</code></pre>