pythonscikit学习如何为多类和多标签数据构建模型?

2024-10-02 14:20:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这样的数据集:

Description  attributes.occasion.0 attributes.occasion.1    attributes.occasion.2   attributes.occasion.3   attributes.occasion.4

 descr01        Chanukah                Christmas               Housewarming        Just Because                Thank You
 descr02        Anniversary             Birthday                Christmas           Graduation                  Mother's Day
 descr03        Chanukah                Christmas               Housewarming        Just Because                Thank You
 descr04        Baby Shower             Birthday                Cinco de Mayo       Gametime                    Just Because
 descr05        Anniversary             Birthday                Christmas           Graduation                  Mother's Day

descr01=>;关于场合的描述(我刚刚把短名称放在实际数据集中,它的全文描述)等等。在

在上面的数据集中,我有一个独立变量,它有文本描述和4个因变量。在

我尝试了随机森林分类器,它以多个依赖项作为输入。在

数据集的一个示例

^{pr2}$

下面是我尝试过的代码:

## Split  the dataset
X_train, X_test, y_train, y_test = train_test_split(df['Description'],df[['attributes.occasion.0','attributes.occasion.1','attributes.occasion.2','attributes.occasion.3','attributes.occasion.4']], test_size=0.3, random_state=0)

## Apply the model


    from sklearn.ensemble import RandomForestClassifier

    tfidf = Pipeline([('vect', HashingVectorizer(ngram_range=(1,7),non_negative=True)),

('tfidf', TfidfTransformer()),

])

def feature_combine(dataset):
    Xall = []
    i=1
    for col in cols_to_retain:
        if col != 'item_id' and col != 'last_updated_at':
            Xall.append(tfidf.fit_transform(dataset[col].astype(str)))

    joblib.dump(tfidf, "tfidf.sav")
    Xspall = scipy.sparse.hstack(Xall)

    #print Xspall
    return Xspall

def test_Data_text_transform_and_combine(dataset):
    Xall = []
    i=1

    for col in cols_to_retain:
        if col != 'item_id' and col != 'last_updated_at':
            Xall.append(tfidf.transform(dataset[col].astype(str)))

    Xspall = scipy.sparse.hstack(Xall)

    return Xspall

from sklearn.ensemble import RandomForestClassifier
text_clf = RandomForestClassifier()
_ = text_clf.fit(feature_combine(X_train), y_train)

RF_predicted = text_clf.predict(test_Data_text_transform_and_combine(X_test))

np.mean(RF_predicted  == y_test)*100 

当我计算精度测量值时,输出值低于输出值?但我知道如何解释这个结果,以及如何绘制混淆矩阵和其他性能指标。在

输出:

Accuracy for each dependent 

attributes.occasion.0    87.517672
attributes.occasion.1    96.050306
attributes.occasion.2    98.362394
attributes.occasion.3    99.184142
attributes.occasion.4    99.564090

有谁能告诉我如何处理多标签问题以及如何评价模型的性能。在这种情况下,有什么可能的方法。我正在使用pythonsickit学习库。在

谢谢, 尼兰詹


Tags: and数据texttesttransformtraincoldataset