Python中支持向量机的libsvm特征示例问题的回答

Python中支持向量机的libsvm特征示例

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我在易趣上刮了很多类似这样的标题： <pre><code>Apple iPhone 5 White 16GB Dual-Core </code></pre> 我已经用这种方式手动标记了所有这些 <pre><code>B M C S NA </code></pre> 其中B=品牌（苹果）M=型号（iPhone 5）C=颜色（白色）S=尺寸（大小）NA=未分配（双核） 现在我需要使用python中的libsvm库来训练一个支持向量机分类器，以学习ebay标题中出现的序列模式。 我需要为这些属性（品牌、型号、颜色、尺寸）提取新的值，将问题视为一个分类问题。这样我就可以预测新的模型。 我想考虑以下特点： <pre><code>* Position - from the beginning of the title - to the end of the listing * Orthographic features - current word contains a digit - current word is capitalized .... </code></pre> 我不明白我怎么能把这些信息都给图书馆。官方文件缺少很多信息 我的班级是品牌、型号、尺码、颜色、不适用 SVM算法的输入文件必须包含什么？ 如何创建它？我能举一个例子来说明这个文件吗？考虑到我在问题中举例说明的4个特性？我也可以有一个代码的例子，我必须用来详细说明输入文件吗？ *更新* 我想代表这些特征。。。我该怎么办？ <ol> <li>当前单词的标识</li> </ol> 我想我可以这样解释 <pre><code>0 --> Brand 1 --> Model 2 --> Color 3 --> Size 4 --> NA </code></pre> 如果我知道这个词是一个品牌，我会把这个变量设为1（真）。在训练测试中这样做是可以的（因为我已经标记了所有的单词），但是我怎样才能在测试集上这样做呢？我不知道一个词的类别是什么（这就是我学习这个词的原因：D）。 <ol start=“2”> <li>当前单词的N-克子串特征（N=4,5,6）不知道，这是什么意思？</li> <li>当前单词前两个单词的同一性。如何对该功能建模？</li> </ol> 考虑到我为第一个特性创建的图例，我有5^（5）个组合） <pre><code>00 10 20 30 40 01 11 21 31 41 02 12 22 32 42 03 13 23 33 43 04 14 24 34 44 </code></pre> 如何将其转换为libsvm（或scikit learn）可以理解的格式？ <ol start=“4”> <li>4个属性词典的成员</li> </ol> 再说一遍，我该怎么做？有4个字典（颜色、大小、型号和品牌），我想我必须创建一个bool变量，只有在4个字典中有一个与当前单词匹配时，我才会将其设置为true。 <ol start=“5”> <li>商标名词典独家会员</li> </ol> 我觉得这和第四部一样。特性我必须使用bool变量。你同意吗？

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

我赞同@MarcoPashkov的评论，但我将尝试详细说明LibSVM文件格式。我发现文档很全面但很难找到，对于Python库，我建议使用<a href="https://github.com/cjlin1/libsvm/blob/2d6fac385db4f082c634bf47ffb12a6e4f77ce42/python/README" rel="nofollow">README on GitHub</a>。 需要注意的一点是，存在一种稀疏格式，其中0的所有特征都将被删除，而密集格式中0的特征不会被删除。这两个都是自述文件中的等效示例。 <pre><code># Dense data >>> y, x = [1,-1], [[1,0,1], [-1,0,-1]] # Sparse data >>> y, x = [1,-1], [{1:1, 3:1}, {1:-1,3:-1}] </code></pre> <code>y</code>变量存储数据的所有类别的列表。 <code>x</code>变量存储特征向量。 <code>assert len(y) == len(x), "Both lists should be the same length"</code> 在<a href="https://github.com/cjlin1/libsvm/blob/2d6fac385db4f082c634bf47ffb12a6e4f77ce42/heart_scale" rel="nofollow">Heart Scale Example</a>中找到的格式是一种稀疏格式，其中dictionary键是特征索引，dictionary值是特征值，而第一个值是类别。 稀疏格式在为特征向量使用<a href="http://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation" rel="nofollow">Bag of Words Representation</a>时非常有用。 <blockquote> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them). For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually. </blockquote> 作为一个使用你开始使用的特征向量的例子，我训练了一个基本的LibSVM 3.20模型。这段代码并不打算使用，但可能有助于展示如何创建和测试模型。 <pre><code>from collections import namedtuple # Using namedtuples for descriptive purposes, in actual code a normal tuple would work fine. Category = namedtuple("Category", ["index", "name"]) Feature = namedtuple("Feature", ["category_index", "distance_from_beginning", "distance_from_end", "contains_digit", "capitalized"]) # Separate up the set of categories, libsvm requires a numerical index so we associate each with an index. categories = dict() for index, name in enumerate("B M C S NA".split(' ')): # LibSVM expects index to start at 1, not 0. categories[name] = Category(index + 1, name) categories Out[0]: {'B': Category(index=1, name='B'), 'C': Category(index=3, name='C'), 'M': Category(index=2, name='M'), 'NA': Category(index=5, name='NA'), 'S': Category(index=4, name='S')} # Faked set of CSV input for example purposes. csv_input_lines = """category_index,distance_from_beginning,distance_from_end,contains_digit,capitalized B,1,10,1,0 M,10,1,0,1 C,2,3,0,1 S,23,2,0,0 NA,12,0,0,1""".split("\n") # We just ignore the header. header = csv_input_lines[0] # A list of Feature namedtuples, this will be trained as lists. features = list() for line in csv_input_lines[1:]: split_values = line.split(',') # Create a Feature with the values converted to integers. features.append(Feature(categories[split_values[0]].index, *map(int, split_values[1:]))) features Out[1]: [Feature(category_index=1, distance_from_beginning=1, distance_from_end=10, contains_digit=1, capitalized=0), Feature(category_index=2, distance_from_beginning=10, distance_from_end=1, contains_digit=0, capitalized=1), Feature(category_index=3, distance_from_beginning=2, distance_from_end=3, contains_digit=0, capitalized=1), Feature(category_index=4, distance_from_beginning=23, distance_from_end=2, contains_digit=0, capitalized=0), Feature(category_index=5, distance_from_beginning=12, distance_from_end=0, contains_digit=0, capitalized=1)] # Y is the category index used in training for each Feature. Now it is an array (order important) of all the trained indexes. y = map(lambda f: f.category_index, features) # X is the feature vector, for this we convert all the named tuple's values except the category which is at index 0. x = map(lambda f: list(f)[1:], features) from svmutil import svm_parameter, svm_problem, svm_train, svm_predict # Barebones defaults for SVM param = svm_parameter() # The (Y,X) parameters should be the train dataset. prob = svm_problem(y, x) model=svm_train(prob, param) # For actual accuracy checking, the (Y,X) parameters should be the test dataset. p_labels, p_acc, p_vals = svm_predict(y, x, model) Out[3]: Accuracy = 100% (5/5) (classification) </code></pre> 我希望这个例子有用，它不应该用于你的训练。它之所以被当作一个例子，仅仅是因为它效率低下。

Python中支持向量机的libsvm特征示例

1 个回答

相关Python问题