有没有办法将PySpark数据帧保存为ARFF格式?

2024-09-24 06:34:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我创建了一个dataframe ready,并用VectorAssembler对其进行了转换,以便与ML库一起使用:

from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import DecisionTreeClassifier

target_index = StringIndexer(inputCol="target", outputCol="target_idx").fit(df)
assembler = VectorAssembler(
inputCols=[
    x for x in df.columns if x not in ['target', 'ident_1', 'id_l', 'target_idx']
    ],
outputCol='features'
)

cl = DecisionTreeClassifier(labelCol='target_idx', featuresCol='features')
pipe = Pipeline(stages=[target_index, assembler, cl])
model = pipe.fit(df_train)
df_transformed = model.stages[1]

现在我想将转换后的数据集写入ARFF文件。是有没有办法写一个已经由VectorAssembler转换成ARFF格式的PySpark数据帧?在


Tags: fromimporttargetdfindexpipelinemlpyspark