为什么在Python spark中更改列名时结果结构会发生变化?

2024-07-03 06:59:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我的数据帧是这样的

       10311  105903003  373873005  385055001  392521001  ...  26  27  28  29  30
0       21.0        5.0        5.0       21.0        8.0  ...   0   0   0   0   1
1        0.0        3.0        3.0        0.0        6.0  ...   0   0   0   0   1
2       32.0        8.0        8.0       32.0        4.0  ...   0   0   0   0   1
3       15.0        7.0        7.0       15.0        5.0  ...   0   0   0   0   1
4        0.0        4.0        4.0        0.0        4.0  ...   0   0   0   0   1
     ...        ...        ...        ...        ...  ...  ..  ..  ..  ..  ..
52699    0.0        2.0        2.0        0.0        6.0  ...   0   0   0   0   1
52700    0.0        2.0        2.0        0.0        6.0  ...   0   0   0   0   1
52701   22.0        4.0        4.0       22.0        9.0  ...   0   0   0   0   1
52702    0.0        4.0        4.0        0.0        8.0  ...   0   0   0   0   1
52703    0.0        2.0        2.0        0.0        2.0  ...   0   0   0   0   1
[52704 rows x 43 columns]

这是我的密码

from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

spark_dff = sqlContext.createDataFrame(dff.astype(float))
spark_dff.head(4)

还这个

Out[99]: 
[Row(10311=21.0, 105903003=5.0, 373873005=5.0, 385055001=21.0, 392521001=8.0, 410942007=5.0, 423367003=12.0, 46992007=21.0, 4850=27.0, 87612001=43.0, filename=1.0, filename_int=1.0, 0=0.0, 1=0.0, 2=0.0, 3=0.0, 4=0.0, 5=0.0, 6=0.0, 7=0.0, 8=0.0, 9=0.0, 10=0.0, 11=0.0, 12=0.0, 13=0.0, 14=0.0, 15=0.0, 16=0.0, 17=0.0, 18=0.0, 19=0.0, 20=0.0, 21=0.0, 22=0.0, 23=0.0, 24=0.0, 25=0.0, 26=0.0, 27=0.0, 28=0.0, 29=0.0, 30=1.0),
 Row(10311=0.0, 105903003=3.0, 373873005=3.0, 385055001=0.0, 392521001=6.0, 410942007=3.0, 423367003=0.0, 46992007=0.0, 4850=6.0, 87612001=3.0, filename=10.0, filename_int=10.0, 0=0.0, 1=0.0, 2=0.0, 3=0.0, 4=0.0, 5=0.0, 6=0.0, 7=0.0, 8=0.0, 9=0.0, 10=0.0, 11=0.0, 12=0.0, 13=0.0, 14=0.0, 15=0.0, 16=0.0, 17=0.0, 18=0.0, 19=0.0, 20=0.0, 21=0.0, 22=0.0, 23=0.0, 24=0.0, 25=0.0, 26=0.0, 27=0.0, 28=0.0, 29=0.0, 30=1.0),
 Row(10311=32.0, 105903003=8.0, 373873005=8.0, 385055001=32.0, 392521001=4.0, 410942007=8.0, 423367003=15.0, 46992007=32.0, 4850=9.0, 87612001=9.0, filename=100.0, filename_int=100.0, 0=0.0, 1=0.0, 2=0.0, 3=0.0, 4=0.0, 5=0.0, 6=0.0, 7=0.0, 8=0.0, 9=0.0, 10=0.0, 11=0.0, 12=0.0, 13=0.0, 14=0.0, 15=0.0, 16=0.0, 17=0.0, 18=0.0, 19=0.0, 20=0.0, 21=0.0, 22=0.0, 23=0.0, 24=0.0, 25=0.0, 26=0.0, 27=0.0, 28=0.0, 29=0.0, 30=1.0),
 Row(10311=15.0, 105903003=7.0, 373873005=7.0, 385055001=15.0, 392521001=5.0, 410942007=7.0, 423367003=7.0, 46992007=15.0, 4850=12.0, 87612001=21.0, filename=10000.0, filename_int=10000.0, 0=0.0, 1=0.0, 2=0.0, 3=0.0, 4=0.0, 5=0.0, 6=0.0, 7=0.0, 8=0.0, 9=0.0, 10=0.0, 11=0.0, 12=0.0, 13=0.0, 14=0.0, 15=0.0, 16=0.0, 17=0.0, 18=0.0, 19=0.0, 20=0.0, 21=0.0, 22=0.0, 23=0.0, 24=0.0, 25=0.0, 26=0.0, 27=0.0, 28=0.0, 29=0.0, 30=1.0)]

spark_dff

还这个

spark_dff
Out[100]: DataFrame[10311: double, 105903003: double, 373873005: double, 385055001: double, 392521001: double, 410942007: double, 423367003: double, 46992007: double, 4850: double, 87612001: double, filename: double, filename_int: double, 0: double, 1: double, 2: double, 3: double, 4: double, 5: double, 6: double, 7: double, 8: double, 9: double, 10: double, 11: double, 12: double, 13: double, 14: double, 15: double, 16: double, 17: double, 18: double, 19: double, 20: double, 21: double, 22: double, 23: double, 24: double, 25: double, 26: double, 27: double, 28: double, 29: double, 30: double]

现在我的问题来了

lz = ['10311','105903003','373873005','385055001']      #<------ (1)

from pyspark.ml.feature import VectorAssembler
vectorAssemblerZ = VectorAssembler(inputCols = lz , outputCol = 'zz')
vhouse_df = vectorAssemblerZ.transform(spark_dff)
vhouse_df = vhouse_df.select(['zz'])
vhouse_df.show(3)

这会回来的

+-------------------+
|                 zz|
+-------------------+
|[21.0,5.0,5.0,21.0]|
|  [0.0,3.0,3.0,0.0]|
|[32.0,8.0,8.0,32.0]|
+-------------------+

这个看起来不错 包含4个值的数组

但当我换行时(1)

lz = ['1','2','3','4']

结果变成了不同的结构

+-----------------+
|               zz|
+-----------------+
|        (4,[],[])|
|    (4,[2],[1.0])|
|    (4,[3],[1.0])|
+-----------------+

它是一个数字,然后是一个数组,然后是另一个数组

我不知道为什么是这种结构

我做了另一个改变

lz = ['10311','105903003','3','4']

结果更奇怪

+------------------+
|                zz|
+------------------+
|[21.0,5.0,0.0,0.0]|
|     (4,[1],[3.0])|
|[32.0,8.0,0.0,0.0]|
+------------------+

为什么当我改变柱子的时候,结构会改变,我怎么才能把它修好


Tags: fromimportdf数组filename结构sparkpyspark
1条回答
网友
1楼 · 发布于 2024-07-03 06:59:45

这是两种不同类型的向量表示,称为密集和稀疏。例如,密集向量[21.0,5.0,0.0,0.0]与稀疏向量(4, [0,1], [21.0, 5.0])相同,其中4表示向量的大小,[0,1]表示具有非零元素的索引,[21.0, 5.0]表示相应的值。所有其他值均假定为零

矢量的表示取决于矢量中的零的数量。如果向量有更多的零,它将由稀疏向量表示,而如果向量有更多的非零元素,它将由密集向量表示

没有必要对此进行任何修复

查看thisspark文档了解更多说明

相关问题 更多 >