我的数据帧是这样的
10311 105903003 373873005 385055001 392521001 ... 26 27 28 29 30
0 21.0 5.0 5.0 21.0 8.0 ... 0 0 0 0 1
1 0.0 3.0 3.0 0.0 6.0 ... 0 0 0 0 1
2 32.0 8.0 8.0 32.0 4.0 ... 0 0 0 0 1
3 15.0 7.0 7.0 15.0 5.0 ... 0 0 0 0 1
4 0.0 4.0 4.0 0.0 4.0 ... 0 0 0 0 1
... ... ... ... ... ... .. .. .. .. ..
52699 0.0 2.0 2.0 0.0 6.0 ... 0 0 0 0 1
52700 0.0 2.0 2.0 0.0 6.0 ... 0 0 0 0 1
52701 22.0 4.0 4.0 22.0 9.0 ... 0 0 0 0 1
52702 0.0 4.0 4.0 0.0 8.0 ... 0 0 0 0 1
52703 0.0 2.0 2.0 0.0 2.0 ... 0 0 0 0 1
[52704 rows x 43 columns]
这是我的密码
from pyspark.sql import SQLContext
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
spark_dff = sqlContext.createDataFrame(dff.astype(float))
spark_dff.head(4)
还这个
Out[99]:
[Row(10311=21.0, 105903003=5.0, 373873005=5.0, 385055001=21.0, 392521001=8.0, 410942007=5.0, 423367003=12.0, 46992007=21.0, 4850=27.0, 87612001=43.0, filename=1.0, filename_int=1.0, 0=0.0, 1=0.0, 2=0.0, 3=0.0, 4=0.0, 5=0.0, 6=0.0, 7=0.0, 8=0.0, 9=0.0, 10=0.0, 11=0.0, 12=0.0, 13=0.0, 14=0.0, 15=0.0, 16=0.0, 17=0.0, 18=0.0, 19=0.0, 20=0.0, 21=0.0, 22=0.0, 23=0.0, 24=0.0, 25=0.0, 26=0.0, 27=0.0, 28=0.0, 29=0.0, 30=1.0),
Row(10311=0.0, 105903003=3.0, 373873005=3.0, 385055001=0.0, 392521001=6.0, 410942007=3.0, 423367003=0.0, 46992007=0.0, 4850=6.0, 87612001=3.0, filename=10.0, filename_int=10.0, 0=0.0, 1=0.0, 2=0.0, 3=0.0, 4=0.0, 5=0.0, 6=0.0, 7=0.0, 8=0.0, 9=0.0, 10=0.0, 11=0.0, 12=0.0, 13=0.0, 14=0.0, 15=0.0, 16=0.0, 17=0.0, 18=0.0, 19=0.0, 20=0.0, 21=0.0, 22=0.0, 23=0.0, 24=0.0, 25=0.0, 26=0.0, 27=0.0, 28=0.0, 29=0.0, 30=1.0),
Row(10311=32.0, 105903003=8.0, 373873005=8.0, 385055001=32.0, 392521001=4.0, 410942007=8.0, 423367003=15.0, 46992007=32.0, 4850=9.0, 87612001=9.0, filename=100.0, filename_int=100.0, 0=0.0, 1=0.0, 2=0.0, 3=0.0, 4=0.0, 5=0.0, 6=0.0, 7=0.0, 8=0.0, 9=0.0, 10=0.0, 11=0.0, 12=0.0, 13=0.0, 14=0.0, 15=0.0, 16=0.0, 17=0.0, 18=0.0, 19=0.0, 20=0.0, 21=0.0, 22=0.0, 23=0.0, 24=0.0, 25=0.0, 26=0.0, 27=0.0, 28=0.0, 29=0.0, 30=1.0),
Row(10311=15.0, 105903003=7.0, 373873005=7.0, 385055001=15.0, 392521001=5.0, 410942007=7.0, 423367003=7.0, 46992007=15.0, 4850=12.0, 87612001=21.0, filename=10000.0, filename_int=10000.0, 0=0.0, 1=0.0, 2=0.0, 3=0.0, 4=0.0, 5=0.0, 6=0.0, 7=0.0, 8=0.0, 9=0.0, 10=0.0, 11=0.0, 12=0.0, 13=0.0, 14=0.0, 15=0.0, 16=0.0, 17=0.0, 18=0.0, 19=0.0, 20=0.0, 21=0.0, 22=0.0, 23=0.0, 24=0.0, 25=0.0, 26=0.0, 27=0.0, 28=0.0, 29=0.0, 30=1.0)]
spark_dff
还这个
spark_dff
Out[100]: DataFrame[10311: double, 105903003: double, 373873005: double, 385055001: double, 392521001: double, 410942007: double, 423367003: double, 46992007: double, 4850: double, 87612001: double, filename: double, filename_int: double, 0: double, 1: double, 2: double, 3: double, 4: double, 5: double, 6: double, 7: double, 8: double, 9: double, 10: double, 11: double, 12: double, 13: double, 14: double, 15: double, 16: double, 17: double, 18: double, 19: double, 20: double, 21: double, 22: double, 23: double, 24: double, 25: double, 26: double, 27: double, 28: double, 29: double, 30: double]
现在我的问题来了
lz = ['10311','105903003','373873005','385055001'] #<------ (1)
from pyspark.ml.feature import VectorAssembler
vectorAssemblerZ = VectorAssembler(inputCols = lz , outputCol = 'zz')
vhouse_df = vectorAssemblerZ.transform(spark_dff)
vhouse_df = vhouse_df.select(['zz'])
vhouse_df.show(3)
这会回来的
+-------------------+
| zz|
+-------------------+
|[21.0,5.0,5.0,21.0]|
| [0.0,3.0,3.0,0.0]|
|[32.0,8.0,8.0,32.0]|
+-------------------+
这个看起来不错 包含4个值的数组
但当我换行时(1)
lz = ['1','2','3','4']
结果变成了不同的结构
+-----------------+
| zz|
+-----------------+
| (4,[],[])|
| (4,[2],[1.0])|
| (4,[3],[1.0])|
+-----------------+
它是一个数字,然后是一个数组,然后是另一个数组
我不知道为什么是这种结构
我做了另一个改变
lz = ['10311','105903003','3','4']
结果更奇怪
+------------------+
| zz|
+------------------+
|[21.0,5.0,0.0,0.0]|
| (4,[1],[3.0])|
|[32.0,8.0,0.0,0.0]|
+------------------+
为什么当我改变柱子的时候,结构会改变,我怎么才能把它修好
这是两种不同类型的向量表示,称为密集和稀疏。例如,密集向量
[21.0,5.0,0.0,0.0]
与稀疏向量(4, [0,1], [21.0, 5.0])
相同,其中4表示向量的大小,[0,1]
表示具有非零元素的索引,[21.0, 5.0]
表示相应的值。所有其他值均假定为零矢量的表示取决于矢量中的零的数量。如果向量有更多的零,它将由稀疏向量表示,而如果向量有更多的非零元素,它将由密集向量表示
没有必要对此进行任何修复
查看thisspark文档了解更多说明
相关问题 更多 >
编程相关推荐