具有重复值的多重LabelBinarizer

2024-09-28 01:58:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个期望的数组[1,1,3]和一个预测的数组[1,2,2,4],我想计算它的precision_recall_fscore_support,所以我需要以下格式的矩阵:

>> mlb = MultiLabelBinarizerWithDuplicates()
>> transformed = mlb.fit_transform([(1, 1, 3), (1, 2, 2, 4)])
array([[1,1,0,0,1,0],
       [1,0,1,1,0,1]])
>> mlb.classes_
[1,1,2,2,3,4]

对于重复的值,我不在乎其中哪一个被打开,这意味着这也是一个有效的结果:

array([[1,1,0,0,1,0],
       [0,1,1,1,0,1]])

MultiLabelBinarizer明确表示“所有条目都应该是唯一的(不能包含重复的类)”,因此它不支持这个用例


Tags: support格式transform矩阵数组arrayprecisionfit
1条回答
网友
1楼 · 发布于 2024-09-28 01:58:58

有效的初步实施:

import itertools
from collections import defaultdict
import copy
import numpy as np

class MultiLabelBinarizerWithDuplicates:
    """
    Similar to https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html
    but added support for duplicated values.
    """

    def __init__(self, mapping=None):
        self.mapping = mapping

    def fit(self, y):
        unique_label_max_count = {}
        for labels in y:
            unique_labels = set(labels)
            for unique_label in unique_labels:
                max_count = unique_label_max_count.get(unique_label, [])
                curr_count = [unique_label] * len([x for x in labels if x == unique_label])
                if len(curr_count) > len(max_count):
                    unique_label_max_count[unique_label] = curr_count

        self.classes_ = sorted(list(itertools.chain.from_iterable(unique_label_max_count.values())))
        self.mapping = defaultdict(list)
        for class_, idx in zip(self.classes_, range(len(self.classes_))):
            self.mapping[class_].append(idx)

        return self

    def transform(self,y):
        result_matrix = []
        for labels in y:
            mapping_copy = copy.deepcopy(self.mapping)
            data = [0]*len(self.classes_)
            for label in labels:
                if label in mapping_copy and len(mapping_copy[label]) > 0:
                    relevant_idx = mapping_copy[label].pop()
                    data[relevant_idx] = 1
            result_matrix.append(data)
        return np.array(result_matrix)

    def fit_transform(self,y):
        return self.fit(y).transform(y)

用法:

>> mlb = MultiLabelBinarizerWithDuplicates()
>> transformed = mlb.fit_transform([(1, 1, 3), (1, 2, 2, 4)])
array([[1,1,0,0,1,0],
       [1,0,1,1,0,1]])
>> mlb.classes_
[1,1,2,2,3,4]

相关问题 更多 >

    热门问题