机器学习Weka如何使用Java代码预测新的不可见实例?
我编写了一个WEKA java代码来训练4个分类器。我保存了分类器模型,并想用它们来预测新的看不见的实例(把它想象成一个想要测试推特是正面的还是负面的人)
我对训练数据使用了StringToWordsVector过滤器。为了避免出现“Src和Dest在#of attributes中不同”的错误,我使用下面的代码使用经过训练的数据来训练过滤器,然后在新实例上应用过滤器来尝试预测新实例是正的还是负的。我就是做不好
Classifier cls = (Classifier) weka.core.SerializationHelper.read("models/myModel.model"); //reading one of the trained classifiers
BufferedReader datafile = readDataFile("Tweets/tone1.ARFF"); //read training data
Instances data = new Instances(datafile);
data.setClassIndex(data.numAttributes() - 1);
Filter filter = new StringToWordVector(50);//keep 50 words
filter.setInputFormat(data);
Instances filteredData = Filter.useFilter(data, filter);
// rebuild classifier
cls.buildClassifier(filteredData);
String testInstance= "Text that I want to use as an unseen instance and predict whether it's positive or negative";
System.out.println(">create test instance");
FastVector attributes = new FastVector(2);
attributes.addElement(new Attribute("text", (FastVector) null));
// Add class attribute.
FastVector classValues = new FastVector(2);
classValues.addElement("Negative");
classValues.addElement("Positive");
attributes.addElement(new Attribute("Tone", classValues));
// Create dataset with initial capacity of 100, and set index of class.
Instances tests = new Instances("test istance", attributes, 100);
tests.setClassIndex(tests.numAttributes() - 1);
Instance test = new Instance(2);
// Set value for message attribute
Attribute messageAtt = tests.attribute("text");
test.setValue(messageAtt, messageAtt.addStringValue(testInstance));
test.setDataset(tests);
Filter filter2 = new StringToWordVector(50);
filter2.setInputFormat(tests);
Instances filteredTests = Filter.useFilter(tests, filter2);
System.out.println(">train Test filter using training data");
Standardize sfilter = new Standardize(); //Match the number of attributes between src and dest.
sfilter.setInputFormat(filteredData); // initializing the filter with training set
filteredTests = Filter.useFilter(filteredData, sfilter); // create new test set
ArffSaver saver = new ArffSaver(); //save test data to ARFF file
saver.setInstances(filteredTests);
File unseenFile = new File ("Tweets/unseen.ARFF");
saver.setFile(unseenFile);
saver.writeBatch();
当我尝试使用过滤后的训练数据标准化输入数据时,我得到了一个新的ARFF文件(unseen.ARFF),但有2000个(相同数量的训练数据)实例,其中大多数值为负值。我不明白为什么或者如何删除这些实例
System.out.println(">Evaluation"); //without the following 2 lines I get ArrayIndexOutOfBoundException.
filteredData.setClassIndex(filteredData.numAttributes() - 1);
filteredTests.setClassIndex(filteredTests.numAttributes() - 1);
Evaluation eval = new Evaluation(filteredData);
eval.evaluateModel(cls, filteredTests);
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
打印我想要看到的评估结果,例如,这个实例的正面或负面程度的百分比,但是我得到以下结果。我还希望看到1个实例,而不是2000个。任何关于如何做到这一点的帮助都将是巨大的
> Results
======
Correlation coefficient 0.0285
Mean absolute error 0.8765
Root mean squared error 1.2185
Relative absolute error 409.4123 %
Root relative squared error 121.8754 %
Total Number of Instances 2000
谢谢
# 1 楼答案
使用
eval.predictions()
。它是一个java.util.ArrayList<Prediction>
。然后你可以使用预测。weight()方法获取测试变量的正负值# 2 楼答案
cls.distributionForInstance(newInst)
返回实例的概率分布。检查docs# 3 楼答案
我已经找到了一个很好的解决方案,在这里我将与您分享我的代码。这将使用WEKA Java代码训练分类器,然后使用它预测新的看不见的实例。某些部分(如路径)是硬编码的,但您可以轻松修改该方法以获取参数