<p>这可能会:</p>
<pre><code>from pyspark.context import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql import functions as F
import pandas as pd
conf = SparkConf().setAppName("appName").setMaster("local")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
df_dict = {
'a': {
"1": "stuff", "2": "stuff2"
},
"d": {
"1": [(1, 2), (3, 4)], "2": [(1, 2), (3, 4)]
}
}
df = pd.DataFrame(df_dict)
ddf = spark.createDataFrame(df)
exploded = ddf.withColumn('d', F.explode("d"))
exploded.show()
</code></pre>
<p>结果:</p>
^{pr2}$
<p>我觉得使用SQL来实现这一点比较舒服:</p>
<pre><code>exploded.createOrReplaceTempView("exploded")
spark.sql("SELECT a, d._1 as value_1, d._2 as value_2 FROM exploded").show()
</code></pre>
<p>重要提示:之所以使用<code>_1</code>和<code>_2</code>访问器,是因为spark将元组解析为一个结构,并给了它默认键。如果在实际实现中,数据帧包含<code>array<int></code>,则应该使用<code>[0]</code>语法。在</p>
<p>最终结果是:</p>
<pre><code>+ + -+ -+
| a|value_1|value_2|
+ + -+ -+
| stuff| 1| 2|
| stuff| 3| 4|
|stuff2| 1| 2|
|stuff2| 3| 4|
+ + -+ -+
</code></pre>