<p>我赞同@MarcoPashkov的评论,但我将尝试详细说明LibSVM文件格式。我发现文档很全面但很难找到,对于Python库,我建议使用<a href="https://github.com/cjlin1/libsvm/blob/2d6fac385db4f082c634bf47ffb12a6e4f77ce42/python/README" rel="nofollow">README on GitHub</a>。</p>
<p>需要注意的一点是,存在一种稀疏格式,其中0的所有特征都将被删除,而密集格式中0的特征不会被删除。这两个都是自述文件中的等效示例。</p>
<pre><code># Dense data
>>> y, x = [1,-1], [[1,0,1], [-1,0,-1]]
# Sparse data
>>> y, x = [1,-1], [{1:1, 3:1}, {1:-1,3:-1}]
</code></pre>
<p><code>y</code>变量存储数据的所有类别的列表。</p>
<p><code>x</code>变量存储特征向量。</p>
<p><code>assert len(y) == len(x), "Both lists should be the same length"</code></p>
<p>在<a href="https://github.com/cjlin1/libsvm/blob/2d6fac385db4f082c634bf47ffb12a6e4f77ce42/heart_scale" rel="nofollow">Heart Scale Example</a>中找到的格式是一种稀疏格式,其中dictionary键是特征索引,dictionary值是特征值,而第一个值是类别。</p>
<p>稀疏格式在为特征向量使用<a href="http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation" rel="nofollow">Bag of Words Representation</a>时非常有用。</p>
<blockquote>
<p>As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).</p>
<p>For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.</p>
</blockquote>
<p>作为一个使用你开始使用的特征向量的例子,我训练了一个基本的LibSVM 3.20模型。这段代码并不打算使用,但可能有助于展示如何创建和测试模型。</p>
<pre><code>from collections import namedtuple
# Using namedtuples for descriptive purposes, in actual code a normal tuple would work fine.
Category = namedtuple("Category", ["index", "name"])
Feature = namedtuple("Feature", ["category_index", "distance_from_beginning", "distance_from_end", "contains_digit", "capitalized"])
# Separate up the set of categories, libsvm requires a numerical index so we associate each with an index.
categories = dict()
for index, name in enumerate("B M C S NA".split(' ')):
# LibSVM expects index to start at 1, not 0.
categories[name] = Category(index + 1, name)
categories
Out[0]: {'B': Category(index=1, name='B'),
'C': Category(index=3, name='C'),
'M': Category(index=2, name='M'),
'NA': Category(index=5, name='NA'),
'S': Category(index=4, name='S')}
# Faked set of CSV input for example purposes.
csv_input_lines = """category_index,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
NA,12,0,0,1""".split("\n")
# We just ignore the header.
header = csv_input_lines[0]
# A list of Feature namedtuples, this will be trained as lists.
features = list()
for line in csv_input_lines[1:]:
split_values = line.split(',')
# Create a Feature with the values converted to integers.
features.append(Feature(categories[split_values[0]].index, *map(int, split_values[1:])))
features
Out[1]: [Feature(category_index=1, distance_from_beginning=1, distance_from_end=10, contains_digit=1, capitalized=0),
Feature(category_index=2, distance_from_beginning=10, distance_from_end=1, contains_digit=0, capitalized=1),
Feature(category_index=3, distance_from_beginning=2, distance_from_end=3, contains_digit=0, capitalized=1),
Feature(category_index=4, distance_from_beginning=23, distance_from_end=2, contains_digit=0, capitalized=0),
Feature(category_index=5, distance_from_beginning=12, distance_from_end=0, contains_digit=0, capitalized=1)]
# Y is the category index used in training for each Feature. Now it is an array (order important) of all the trained indexes.
y = map(lambda f: f.category_index, features)
# X is the feature vector, for this we convert all the named tuple's values except the category which is at index 0.
x = map(lambda f: list(f)[1:], features)
from svmutil import svm_parameter, svm_problem, svm_train, svm_predict
# Barebones defaults for SVM
param = svm_parameter()
# The (Y,X) parameters should be the train dataset.
prob = svm_problem(y, x)
model=svm_train(prob, param)
# For actual accuracy checking, the (Y,X) parameters should be the test dataset.
p_labels, p_acc, p_vals = svm_predict(y, x, model)
Out[3]: Accuracy = 100% (5/5) (classification)
</code></pre>
<p>我希望这个例子有用,它不应该用于你的训练。它之所以被当作一个例子,仅仅是因为它效率低下。</p>