中的partial_fit()中的正确参数是什么sklearn.multiclass.OneVsRestClassifier?

2024-09-28 01:24:30 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在处理一个多标签分类问题,所以我从sklearn.multi类.
将列表的多个标签(1)转换为列表的数组。(总共5975个标签)
因为输入数据太大,所以我对每个数据批都使用partial_fit(),但当我运行代码时,它引发了一个错误:
“ValueError:对象未安装多标签输入。”
似乎是你的火车没有安装,但当我使用“fit()”时,你的火车是可以的。在

我只想知道OneVsRestClassifier partial_fit()中参数“y”和“classes”的正确格式是什么?
是列表列表还是多标签二进制?类应该是所有标签的数组还是被二进制化?


问题代码:

classif = OneVsRestClassifier(estimator=SGDClassifier())

    for i,(X_train, y_train) in enumerate(minibatch_iterators):
        classif.partial_fit(X_train,y_train,classes=np.array(list(range(1,5975))))

以下是数据格式
十: input data

^{pr2}$

是:
原始格式(列表列表):

[[1006,1093,2109,2539,3104,351,3558,5077,5827],
[1076,263,3156,324,405,4079,4707,5560,730],
[5325],
[1077,3755,3863,4256],
[2883],...]

在多二进制之后。数组中有一些1:
binarized labels


完整的python代码如下:

import csv
import pandas as pd
import numpy as np
#from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.linear_model import SGDClassifier

def generate_input_label_list(csvPath):
    print("reading input labels:",csvPath)
    with open(csvPath, 'rt') as f:
        reader = csv.reader(f)
        labellist = list(reader)#generate a string format labellist
    intLabelList=[] # def a labellist to store the label after convertion
    for i in labellist: # convert string label to list
        temp=[]
        for j in i:
            j=int(j)
            temp.append(j)
        temp.sort() # sort each multilabel
        intLabelList.append(temp)
    print("finish reading input labels!")
    return intLabelList

mlb = MultiLabelBinarizer(classes=[x for x in range(1,5975)])
alllabel=generate_input_label_list("target_train.csv")
all_y=mlb.fit_transform(alllabel)

def iter_minibatch(X_path,y_path,minibatch_size=45):
    # data batch generator
    # generate a batch of data and label once a time,default size is 45
    reader = pd.read_csv(X_path,header=None,chunksize=minibatch_size,iterator=True)
    labelList = generate_input_label_list(y_path)
    for i in range(0,12589):
        X_batch=reader.get_chunk()
        X=X_batch.values
        y_list=labelList[i*minibatch_size:i*minibatch_size+minibatch_size]
        y=mlb.fit_transform(y_list)
        yield X,y

minibatch_iterators=iter_minibatch("data_train.csv","target_train.csv")
X_test,y_test=minibatch_iterators.__next__()

classif = OneVsRestClassifier(estimator=SGDClassifier())

for i,(X_train, y_train) in enumerate(minibatch_iterators):
    classif.partial_fit(X_train,y_train,classes=np.array(list(range(1,5975))))
    #classif.fit(X_train,y_train)
    print("time:",i)
    print(X_train)
    print(y_train)
    print("score:",classif.score(X_test,y_test))

Tags: csvinimport列表forinputsizetrain

热门问题