<p>“使对象持久化”基本上是指将存储在内存中的二进制代码转储到硬盘上的一个文件中,以便以后在程序或任何其他程序中可以将对象从硬盘上的文件重新加载到内存中。</p>
<p>scikit learn included<code>joblib</code>或stdlib<code>pickle</code>和<code>cPickle</code>都可以完成这项工作。
我更喜欢<code>cPickle</code>,因为它明显更快。使用<a href="http://ipython.org/ipython-doc/3/interactive/magics.html#magic-timeit" rel="noreferrer">ipython's %timeit command</a>:</p>
<pre><code>>>> from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
>>> t = TFIDF()
>>> t.fit_transform(['hello world'], ['this is a test'])
# generic serializer - deserializer test
>>> def dump_load_test(tfidf, serializer):
...: with open('vectorizer.bin', 'w') as f:
...: serializer.dump(tfidf, f)
...: with open('vectorizer.bin', 'r') as f:
...: return serializer.load(f)
# joblib has a slightly different interface
>>> def joblib_test(tfidf):
...: joblib.dump(tfidf, 'tfidf.bin')
...: return joblib.load('tfidf.bin')
# Now, time it!
>>> %timeit joblib_test(t)
100 loops, best of 3: 3.09 ms per loop
>>> %timeit dump_load_test(t, pickle)
100 loops, best of 3: 2.16 ms per loop
>>> %timeit dump_load_test(t, cPickle)
1000 loops, best of 3: 879 µs per loop
</code></pre>
<p>现在,如果要在一个文件中存储多个对象,可以轻松创建一个数据结构来存储它们,然后转储数据结构本身。这将与<code>tuple</code>、<code>list</code>或<code>dict</code>一起工作。
以你的问题为例:</p>
<pre><code># train
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)
selector = SelectKBest(chi2, k = 5000 )
X_train_sel = selector.fit_transform(X_train, y_train)
# dump as a dict
data_struct = {'vectorizer': vectorizer, 'selector': selector}
# use the 'with' keyword to automatically close the file after the dump
with open('storage.bin', 'wb') as f:
cPickle.dump(data_struct, f)
</code></pre>
<p>稍后或在另一个程序中,以下语句将在程序内存中恢复数据结构:</p>
<pre><code># reload
with open('storage.bin', 'rb') as f:
data_struct = cPickle.load(f)
vectorizer, selector = data_struct['vectorizer'], data_struct['selector']
# do stuff...
vectors = vectorizer.transform(...)
vec_sel = selector.transform(vectors)
</code></pre>