<p>您可以为此创建一个类。我公司的spark 2.4也有同样的问题,所以我试着为二元分类制作一个F1分数评估器。我必须为新类指定<code>.evaluate</code>和<code>.isLargerBetter</code>方法。以下是我尝试使用<a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html" rel="nofollow noreferrer">this</a>数据集时的示例代码:</p>
<pre><code>class F1BinaryEvaluator():
def __init__(self, predCol="prediction", labelCol="label", metricLabel=1.0):
self.labelCol = labelCol
self.predCol = predCol
self.metricLabel = metricLabel
def isLargerBetter(self):
return True
def evaluate(self, dataframe):
tp = dataframe.filter(self.labelCol + ' = ' + str(self.metricLabel) + ' and ' + self.predCol + ' = ' + str(self.metricLabel)).count()
fp = dataframe.filter(self.labelCol + ' != ' + str(self.metricLabel) + ' and ' + self.predCol + ' = ' + str(self.metricLabel)).count()
fn = dataframe.filter(self.labelCol + ' = ' + str(self.metricLabel) + ' and ' + self.predCol + ' != ' + str(self.metricLabel)).count()
return tp / (tp + (.5 * (fn +fp)))
f1_evaluator = F1BinaryEvaluator()
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier()
paramGrid = (ParamGridBuilder()
.addGrid(gbt.maxDepth, [3, 5, 7])
.addGrid(gbt.maxBins, [10, 30])
.addGrid(gbt.maxIter, [10, 15])
.build())
cv = CrossValidator(estimator=gbt, estimatorParamMaps=paramGrid, evaluator=f1_evaluator, numFolds=5)
cvModel = cv.fit(train)
cv_pred = cvModel.bestModel.transform(test)
</code></pre>
<p>简历过程运行没有问题,尽管我不知道性能如何。我还尝试将evaluator与<code>sklearn.metrics.f1_score</code>进行比较,结果接近</p>
<pre><code>from sklearn.metrics import f1_score
print("made-up F1 Score evaluator : ", f1_evaluator.evaluate(cv_pred))
print("sklearn F1 Score evaluator : ", f1_score(cv_pred.select('label').toPandas(), cv_pred.select('prediction').toPandas()))
made-up F1 Score evaluator : 0.9363636363636364
sklearn F1 Score evaluator : 0.9363636363636363
</code></pre>