如何在pyspark.bucketiz公司

data = [(0, -1.0), (1, 0.0), (2, 0.5), (3, 1.0), (4, 10.0),(5, 25.0),(6, 100.0),(7, 300.0),(8,float("nan"))] df = spark.createDataFrame(data, ["id", "value"]) splits = [-float("inf"),0,0.001, 1, 5,10, 20, 30, 40, 50, 60, 70, 80, 90, 100, float("inf")] result_bucketizer = Bucketizer(splits=splits, inputCol="value",outputCol="result").setHandleInvalid("keep").transform(df) result_bucketizer.show()

+---+-----+------+ | id|value|result| +---+-----+------+ | 0| -1.0| -inf| | 1| 0.0| 0.0| | 2| 0.5| 0.001| | 3| 1.0| 1.0| | 4| 10.0| 10.0| | 5| 25.0| 20.0| | 6|100.0| 100.0| | 7|300.0| 100.0| | 8| NaN| NaN| +---+-----+------+

1条回答

网友

1楼 · 发布于 2024-09-29 17:12:33

我就是这样做的。在

首先我创建了数据帧。在

from pyspark.ml.feature import Bucketizer
from pyspark.sql.types import StringType

data = [(0, -1.0), (1, 0.0), (2, 0.5), (3, 1.0), (4, 10.0),(5, 25.0),(6, 100.0),(7, 300.0),(8,float("nan"))]
df = spark.createDataFrame(data, ["id", "value"])
splits = [-float("inf"),0,0.001, 1, 5,10, 20, 30, 40, 50, 60, 70, 80, 90, 100, float("inf")]
# here I created a dictionary with {index: name of split}
splits_dict = {i:splits[i] for i in range(len(splits))}

然后我创建bucketizer作为一个单独的变量。在

^{pr2}$

为了得到标签，我使用前面定义的dict应用了replace函数。在

bucketed = bucketed.replace(to_replace=splits_dict, subset=['result'])
bucketed.show()

输出：

+ -+  -+    -+
| id|value|   result|
+ -+  -+    -+
|  0| -1.0|-Infinity|
|  1|  0.0|      0.0|
|  2|  0.5|    0.001|
|  3|  1.0|      1.0|
|  4| 10.0|     10.0|
|  5| 25.0|     20.0|
|  6|100.0|    100.0|
|  7|300.0|    100.0|
+ -+  -+    -+

相关问题更多 >

编程相关推荐

热门问题

热门文章