重塑数据帧,更改列的位置

2024-09-26 22:53:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据框架如下,我想重塑数据的形式如下,我从早上开始就在做这个任务,我无法解决它,有人能帮我吗

2019.    C1.   2018.   C2    2017     C3
FF.       20.   TT.    70.   HH.      88
DD.       22.   JJ.    66.   DD.      99

重塑为:

COL1.   C

FF.     20.    2019
DD.     22.    2019
TT.     70.    2018
JJ.     66.    2108
HH.     88.    2017
DD.     99.    2017

Tags: 数据框架hhdd形式col1ffc2
2条回答

使用stackSpark函数将数据拆分为行 https://spark.apache.org/docs/latest/api/sql/index.html#stack

val df = Seq(
  ("FF", 20, "TT", 70, "HH", 88),
  ("DD", 22, "JJ", 66, "DD", 99),
).toDF("2019","C1","2018","C2","2017","C3")

df.createOrReplaceTempView("df")
df.show

val df1 = spark.sql("SELECT 2019 as year2019, `2019`, `C1`, 2018 as year2018, `2018`, `C2`, 2017 as year2017, `2017`, `C3` from df")

df1.createOrReplaceTempView("df1")
df1.show

spark.sql("SELECT stack(3, `2019`, `C1`, year2019, `2018`, `C2`, year2018, `2017`, `C3`, year2017) as (`COL1`, `C`, `Year`) from df1").show

// Exiting paste mode, now interpreting.

+  + -+  + -+  + -+
|2019| C1|2018| C2|2017| C3|
+  + -+  + -+  + -+
|  FF| 20|  TT| 70|  HH| 88|
|  DD| 22|  JJ| 66|  DD| 99|
+  + -+  + -+  + -+

+    +  + -+    +  + -+    +  + -+
|year2019|2019| C1|year2018|2018| C2|year2017|2017| C3|
+    +  + -+    +  + -+    +  + -+
|    2019|  FF| 20|    2018|  TT| 70|    2017|  HH| 88|
|    2019|  DD| 22|    2018|  JJ| 66|    2017|  DD| 99|
+    +  + -+    +  + -+    +  + -+

+  + -+  +
|COL1|  C|Year|
+  + -+  +
|  FF| 20|2019|
|  TT| 70|2018|
|  HH| 88|2017|
|  DD| 22|2019|
|  JJ| 66|2018|
|  DD| 99|2017|
+  + -+  +

我认为在您想要的输出中有一个小的输入错误,您有2108而不是2018

我尝试完全复制您的数据帧:

>>> df.to_dict()
Out[99]: 
{'2019.': {0: 'FF.', 1: 'DD.'},
 'C1.': {0: 20.0, 1: 22.0},
 '2018.': {0: 'TT.', 1: 'JJ.'},
 'C2': {0: 70.0, 1: 66.0},
 2017: {0: 'HH.', 1: 'DD.'},
 'C3': {0: 88, 1: 99}}

将原始df分为两部分。然后使用pd.melt()rename(axis=1)drop()

# One
one = df[['2019.','2018.',2017]].melt().rename({'variable':'','value':'COL1.'},axis=1)

print(one)
          COL1.
0  2019.0   FF.
1  2019.0   DD.
2  2018.0   TT.
3  2018.0   JJ.
4    2017   HH.
5    2017   DD.

# Two
two = df[['C1.','C2','C3']].melt().drop('variable',axis=1).rename({'value':'C'},axis=1)

print(two)
      C
0  20.0
1  22.0
2  70.0
3  66.0
4  88.0
5  99.0

最后,我使用pd.concatreindex()来获得所需的列顺序:

out = pd.concat([one,two],axis=1).reindex(['COL1.','C',''], axis=1)

  COL1.     C        
0   FF.  20.0  2019.0
1   DD.  22.0  2019.0
2   TT.  70.0  2018.0
3   JJ.  66.0  2018.0
4   HH.  88.0    2017
5   DD.  99.0    2017

相关问题 更多 >

    热门问题