Pandas到PySpark：将元组列表的列转换为每个元组项的单独列问题的回答

Pandas到PySpark：将元组列表的列转换为每个元组项的单独列

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

更新 如果从具有以下架构的数据帧开始： <pre class="lang-python prettyprint-override"><code>ddf.printSchema() #root # | a: string (nullable = true) # | d: array (nullable = true) # | | element: struct (containsNull = true) # | | | _1: long (nullable = true) # | | | _2: long (nullable = true) </code></pre> 必须使用<code>pyspark.sql.functions.explode</code>将数组分解为列，但之后可以使用<code>*</code>选择器将结构转换为列： ^{pr2}$ 要重命名列，可以使用列表理解和<code>str.replace</code>： <pre class="lang-python prettyprint-override"><code>from pyspark.sql.functions import col row_breakdown = row_breakdown.select( *[col(c).alias(c.replace("_", "value")) for c in row_breakdown.columns] ) row_breakdown.show() #+ + + + #| a|value1|value2| #+ + + + #| stuff| 1| 2| #| stuff| 3| 4| #|stuff2| 1| 2| #|stuff2| 3| 4| #+ + + + </code></pre> <hr/> 原始答案 如果你从字典开始，你根本不需要为此使用<code>pandas</code>。在 相反，您可以直接从字典创建数据帧。关键是<a href="https://stackoverflow.com/a/51561188/5858851">transform your dictionary into the appropriate format</a>，然后使用它来构建Spark数据帧。在 在您的示例中，似乎根本没有使用<code>a</code>键下的值。在 正如I<a href="https://stackoverflow.com/questions/52243200/pandas-to-pyspark-transforming-a-column-of-lists-of-tuples-to-separate-columns#comment91439756_52243200">mentioned in my comment</a>，您可以使用以下代码实现所述的输出： <pre class="lang-python prettyprint-override"><code>df_dict = { 'a': { "1": "stuff", "2": "stuff2" }, "d": { "1": [(1, 2), (3, 4)], "2": [(1, 2), (3, 4)] } } from itertools import chain row_breakdown = spark.createDataFrame( chain.from_iterable(df_dict["d"].values()), ["value1", "value2"] ) row_breakdown.show() #+ + + #|value1|value2| #+ + + #| 1| 2| #| 3| 4| #| 1| 2| #| 3| 4| #+ + + </code></pre> 如果您想要一个类似索引的列，可以通过简单地使用<code>enumerate</code>来实现，如下面的示例所示。在这里，我还按键对值进行排序，因为这似乎是您的意图。在 <pre class="lang-python prettyprint-override"><code>data = ( (i,) + v for i, v in enumerate( chain.from_iterable( v for k, v in sorted(df_dict["d"].items(), key=lambda (key, val): key) ) ) ) columns = ["index", "value1", "value2"] row_breakdown = spark.createDataFrame(data, columns) row_breakdown.show() #+ -+ + + #|index|value1|value2| #+ -+ + + #| 0| 1| 2| #| 1| 3| 4| #| 2| 1| 2| #| 3| 3| 4| #+ -+ + + </code></pre> 正如您在这里看到的，我们可以将一个生成器表达式传递给<code>spark.createDataFrame</code>，而且这个解决方案不需要我们提前知道元组的长度。在

Pandas到PySpark：将元组列表的列转换为每个元组项的单独列

1 个回答

相关Python问题