如何更改pyspark数据帧中列的顺序？

2024-09-28 05:42:06 发布

男 | 程序猿一只，喜欢编程写python代码。

我有pyspark数据帧，其中包含监控数据。在我的数据帧中，标签属性可以出现在任何位置。我想把label属性移到dataframe中的最后一个。例如，假设我的dataframe中的属性是['age'、'gender'、'defaulter'、'salary'、'occulation']等。在这个'defaulter'中是label属性。我想把这个属性移到最后，这样我的数据框中包含的列的顺序是['age'，'gender'，'salary'，'occulation'，'defaulter']。我之所以这么做是因为当我想在这个数据中应用逻辑回归之类的ML算法时，我必须将其转换成RDD并提取最后一个值（或第一个值）作为一个标记点（https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/logistic_regression.py）。在

Tags：数据算法 dataframe age 属性顺序标签 gender

1条回答

网友

1楼 · 发布于 2024-09-28 05:42:06

如果在数据帧上运行ML算法，请考虑使用VectorAssembler来创建features数组。像这样：

assembler = VectorAssembler(
    inputCols= ['age','gender','salary','occupation'],
    outputCol="features")

input_rdd = assembler.transform(dataframe) \
    .map(lambda row: LabeledPoint(row.defaulter, row.features))

如何更改pyspark数据帧中列的顺序？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何更改pyspark数据帧中列的顺序？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >