<p>下面是一个分步指导,教你如何使用数据训练支持向量机,然后使用相同的数据集进行评估。也可以在<a href="http://nbviewer.ipython.org/gist/anonymous/2cf3b993aab10bf26d5f" rel="nofollow noreferrer">http://nbviewer.ipython.org/gist/anonymous/2cf3b993aab10bf26d5f</a>找到。在url中,您还可以看到中间数据的输出和结果的准确性(这是一个<a href="http://ipython.org/notebook.html" rel="nofollow noreferrer">iPython notebook</a>)</p>
<h3>步骤0:安装依赖项</h3>
<p>您需要安装以下库:</p>
<ul>
<li>熊猫</li>
<li>scikit学习</li>
</ul>
<p>从命令行:</p>
<pre><code>pip install pandas
pip install scikit-learn
</code></pre>
<h3>步骤1:加载数据</h3>
<p>我们将使用熊猫来加载我们的数据。
pandas是一个易于加载数据的库。为了说明这一点,我们首先保存
示例数据到csv,然后加载它。</p>
<p>我们将用<code>train.csv</code>训练支持向量机,并用<code>test.csv</code>获取测试标签</p>
<pre><code>import pandas as pd
train_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1"""
with open('train.csv', 'w') as output:
output.write(train_data_contents)
train_dataframe = pd.read_csv('train.csv')
</code></pre>
<h3>步骤2:处理数据</h3>
<p>我们将把数据帧转换成numpy数组,这是scikit的格式-
学会理解。</p>
<p>我们需要转换标签“B”、“M”、“C”,。。。也因为支持向量机
不懂弦乐。</p>
<p>然后用数据训练一个线性支持向量机</p>
<pre><code>import numpy as np
train_labels = train_dataframe.class_label
labels = list(set(train_labels))
train_labels = np.array([labels.index(x) for x in train_labels])
train_features = train_dataframe.iloc[:,1:]
train_features = np.array(train_features)
print "train labels: "
print train_labels
print
print "train features:"
print train_features
</code></pre>
<p>我们在这里看到,<code>train_labels</code>(5)的长度正好匹配多少行
我们有<code>trainfeatures</code>。<code>train_labels</code>中的每个项对应一行。</p>
<h3>步骤3:训练SVM</h3>
<pre><code>from sklearn import svm
classifier = svm.SVC()
classifier.fit(train_features, train_labels)
</code></pre>
<h3>步骤4:在一些测试数据上评估支持向量机</h3>
<pre><code>test_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1
"""
with open('test.csv', 'w') as output:
output.write(test_data_contents)
test_dataframe = pd.read_csv('test.csv')
test_labels = test_dataframe.class_label
labels = list(set(test_labels))
test_labels = np.array([labels.index(x) for x in test_labels])
test_features = test_dataframe.iloc[:,1:]
test_features = np.array(test_features)
results = classifier.predict(test_features)
num_correct = (results == test_labels).sum()
recall = num_correct / len(test_labels)
print "model accuracy (%): ", recall * 100, "%"
</code></pre>
<h3>链接和提示</h3>
<ul>
<li>如何加载LinearSVC的示例代码:<a href="http://scikitlearn.org/stable/modules/svm.html#svm" rel="nofollow noreferrer">http://scikitlearn.org/stable/modules/svm.html#svm</a></li>
<li>scikit学习示例的长列表:<a href="http://scikitlearn.org/stable/auto_examples/index.html" rel="nofollow noreferrer">http://scikitlearn.org/stable/auto_examples/index.html</a>。我发现这些有点帮助,但是
经常弄糊涂自己。</li>
<li>如果您发现SVM需要很长时间来训练,请尝试LinearSVC
取而代之:<a href="http://scikitlearn.org/stable/modules/generated/sklearn.svm.LinearSVC.html" rel="nofollow noreferrer">http://scikitlearn.org/stable/modules/generated/sklearn.svm.LinearSVC.html</a></li>
<li>下面是另一个熟悉机器学习模型的教程:<a href="http://scikit-learn.org/stable/tutorial/basic/tutorial.html" rel="nofollow noreferrer">http://scikit-learn.org/stable/tutorial/basic/tutorial.html</a></li>
</ul>
<p>您应该能够使用此代码并用您的培训数据替换<code>train.csv</code>,用您的测试数据替换<code>test.csv</code>,并获得测试数据的预测和准确结果。</p>
<p>请注意,由于您使用的是您培训过的数据,因此评估的准确性将异常高。</p>