用Weka训练分类器的java速度太慢
我正在用Weka
构建一个分类器,我的数据集是稀疏的(文本数据)。我需要自己构建特征向量,而无需使用Weka
的实用类将文本文档转换为特征向量。问题是训练任何分类器都非常慢,尽管特征和样本的数量很少
我用人工稀疏特征向量样本编写了一个测试用例,向您展示它有多慢。你可以运行它
public static void test() throws Exception {
System.out.println( "Started test ... " + new Date() );
Classifier clf = new SimpleLogistic();
int numberOfFeatures = 2000;
int numberOfSamples = 6000;
Random rnd = new Random(0);
//Define dataset
FastVector attributes = new FastVector(numberOfFeatures + 1);
for (Integer i = 0; i < numberOfFeatures; i++) {
attributes.addElement( new Attribute(i.toString()) );
}
FastVector classes = new FastVector( 2 );
classes.addElement( "Positive" );
classes.addElement( "Negative" );
attributes.addElement( new Attribute( "class", classes ) );
Instances data = new Instances("", attributes, 100);
data.setClassIndex(data.numAttributes()-1);
//Create artifical sparse feature vectors for the positive class
for ( int i = 0; i < numberOfSamples/2; i++ ) {
double[] vec = new double[numberOfFeatures + 1];
vec[rnd.nextInt(numberOfFeatures)] = 1;
vec[rnd.nextInt(numberOfFeatures)] = 1;
vec[rnd.nextInt(numberOfFeatures)] = 1;
vec[rnd.nextInt(numberOfFeatures)] = 1;
Instance instance = new Instance(1.0, vec);
instance.setDataset(data);
Instance sparseInstance = new SparseInstance(instance);
sparseInstance.setDataset(data);
sparseInstance.setClassValue("Positive");
data.add(sparseInstance);
}
//Create artifical sparse feature vectors for the negative class
for ( int i = 0; i < numberOfSamples/2; i++ ) {
double[] vec = new double[numberOfFeatures + 1];
vec[rnd.nextInt(numberOfFeatures)] = 1;
vec[rnd.nextInt(numberOfFeatures)] = 1;
vec[rnd.nextInt(numberOfFeatures)] = 1;
vec[rnd.nextInt(numberOfFeatures)] = 1;
Instance instance = new Instance(1.0, vec);
instance.setDataset(data);
Instance sparseInstance = new SparseInstance(instance);
sparseInstance.setDataset(data);
sparseInstance.setClassValue("Negative");
data.add(sparseInstance);
}
System.out.println( "Building classifier ... " );
clf.buildClassifier(data);
System.out.println( new Date() );
}
我不确定是否有什么我应该做的,以使它更快!这对我来说毫无意义,因为梯度下降应该运行得很快。我尝试了一个MultilayerPerceptron
分类器,它有一个隐藏层、一个隐藏单元和一个纪元,但速度非常慢
编辑:
我尝试了测试用例的相同想法,但是使用了scikit-learn
,速度非常快!在这里:
import numpy as np
import random
from sklearn import linear_model
numberOfFeatures = 2000;
numberOfSamples = 6000;
X = np.zeros( (numberOfSamples, numberOfFeatures) )
y = np.zeros(numberOfSamples)
for i in xrange( numberOfSamples ):
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
X[i][ random.randint(0, numberOfFeatures - 1) ] = 1;
for i in xrange( 100 ):
y[i] = 1
clf = linear_model.LogisticRegression()
print 'fitting'
clf.fit(X, y)
print 'done!'
共 (0) 个答案