在sklearn中使用数据集

2024-09-24 22:27:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个.csv格式的数据集

id,interaction_flag,x_coordinate,y_coordinate,z_coordinate,hydrophobicity_kd,hydrophobicity_ww,hydrophobicity_hh,surface_tension,charge_cooh,charge_nh3,charge_r,alpha_helix,beta_strand,turn,van_der_walls,mol_wt,solublity  
229810,1,-33.8675148907451,-110.273691995647,100.021824089754,0.129381338742408,0.129381338742408,0.129381338742408,57.9996957403639,2.20539553752535,9.55985801217038,4.47146044624688,1.08064908722114,1.20135902636915,0.611653144016251,145.232251521298,107.951643002026,21.5344036511141        
229811,1,-26.9070290467923,-117.172163712053,106.980243932766,0.922048681541592,0.922048681541592,0.922048681541592,58.5383367139972,2.03983772819472,9.23210953346856,1.58401622717997,0.84178498985806,1.0387626774848,0.921703853955354,124.73630831643,84.1570182555755,10.7648600405665

我试图通过这个链接从这个数据中获取接收器操作特性(ROC):http://scikit-learn.org/0.11/auto_examples/plot_roc.html

我的目标是interaction_flag列,test是interaction_flag之后的所有列。 但是,我的程序继续以永无止境的状态运行。在

当我运行该链接中给出的测试示例时,它将在瞬间运行。在

谁能告诉我我做错了什么吗?或者我需要其他东西来加载我的数据吗?在

我的代码:

^{pr2}$

我的.csv文件位于:http://pastebin.com/iet5xQW2 我将如何用这个.csv绘制roc


Tags: csv数据idhttpcoordinate链接格式hh
1条回答
网友
1楼 · 发布于 2024-09-24 22:27:56

您需要有两个不同的标签来绘制ROC曲线。 如果我在您的数据中添加一些0标签,下面的示例适用于我。我用熊猫来读取数据,其余都是和sklearn一样的例子。在

此外,您需要将数据集拆分为训练集和测试集,以便在测试集上绘制ROC曲线。在

import pandas as pd
import numpy as np
from scipy import interp
import pylab as pl

from sklearn import svm
from sklearn.metrics import roc_curve, auc
from sklearn.cross_validation import StratifiedKFold




def data(filename):
    X = pd.read_table(filename, sep=',', warn_bad_lines=True, error_bad_lines=True, low_memory = False)

    X = np.asarray(X)

    data = X[:,2:]
    labels = X[:,1]
    print np.unique(labels)

    return data, labels




filename = '../data/sodata.csv'
X, y = data(filename)

###############################################################################
# Classification and ROC analysis

# Run classifier with cross-validation and plot ROC curves
cv = StratifiedKFold(y, n_folds=6)
classifier = svm.SVC(kernel='linear', probability=True, random_state=0)

mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)
all_tpr = []

for i, (train, test) in enumerate(cv):
    probas_ = classifier.fit(X[train], y[train]).predict_proba(X[test])
    # Compute ROC curve and area the curve
    fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
    mean_tpr += interp(mean_fpr, fpr, tpr)
    mean_tpr[0] = 0.0
    roc_auc = auc(fpr, tpr)
    pl.plot(fpr, tpr, lw=1, label='ROC fold %d (area = %0.2f)' % (i, roc_auc))

pl.plot([0, 1], [0, 1], ' ', color=(0.6, 0.6, 0.6), label='Luck')

mean_tpr /= len(cv)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
pl.plot(mean_fpr, mean_tpr, 'k ',
        label='Mean ROC (area = %0.2f)' % mean_auc, lw=2)

pl.xlim([-0.05, 1.05])
pl.ylim([-0.05, 1.05])
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Receiver operating characteristic example')
pl.legend(loc="lower right")
pl.show()

相关问题 更多 >