如何存储TfidfVectorizer以备将来在scikit learn中使用?

2024-09-28 01:33:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个TfidfVectorizer,它将文章集合矢量化,然后选择特性。

vectroizer = TfidfVectorizer()
X_train = vectroizer.fit_transform(corpus)
selector = SelectKBest(chi2, k = 5000 )
X_train_sel = selector.fit_transform(X_train, y_train)

现在,我想把它储存起来,并在其他程序中使用。我不想在训练数据集上重新运行TfidfVectorizer()和功能选择器。我该怎么做?我知道如何使用joblib使模型持久化,但我想知道这是否与使模型持久化相同。


Tags: 模型程序文章transformtraincorpus特性selector
3条回答

下面是我使用joblib的答案:

joblib.dump(vectorizer, 'vectroizer.pkl')
joblib.dump(selector, 'selector.pkl')

稍后,我可以加载它并准备好:

vectorizer = joblib.load('vectorizer.pkl')
selector = joblib.load('selector.pkl')

test = selector.trasnform(vectorizer.transform(['this is test']))

“使对象持久化”基本上是指将存储在内存中的二进制代码转储到硬盘上的一个文件中,以便以后在程序或任何其他程序中可以将对象从硬盘上的文件重新加载到内存中。

scikit learn includedjoblib或stdlibpicklecPickle都可以完成这项工作。 我更喜欢cPickle,因为它明显更快。使用ipython's %timeit command

>>> from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
>>> t = TFIDF()
>>> t.fit_transform(['hello world'], ['this is a test'])

# generic serializer - deserializer test
>>> def dump_load_test(tfidf, serializer):
...:    with open('vectorizer.bin', 'w') as f:
...:        serializer.dump(tfidf, f)
...:    with open('vectorizer.bin', 'r') as f:
...:        return serializer.load(f)

# joblib has a slightly different interface
>>> def joblib_test(tfidf):
...:    joblib.dump(tfidf, 'tfidf.bin')
...:    return joblib.load('tfidf.bin')

# Now, time it!
>>> %timeit joblib_test(t)
100 loops, best of 3: 3.09 ms per loop

>>> %timeit dump_load_test(t, pickle)
100 loops, best of 3: 2.16 ms per loop

>>> %timeit dump_load_test(t, cPickle)
1000 loops, best of 3: 879 µs per loop

现在,如果要在一个文件中存储多个对象,可以轻松创建一个数据结构来存储它们,然后转储数据结构本身。这将与tuplelistdict一起工作。 以你的问题为例:

# train
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)
selector = SelectKBest(chi2, k = 5000 )
X_train_sel = selector.fit_transform(X_train, y_train)

# dump as a dict
data_struct = {'vectorizer': vectorizer, 'selector': selector}
# use the 'with' keyword to automatically close the file after the dump
with open('storage.bin', 'wb') as f: 
    cPickle.dump(data_struct, f)

稍后或在另一个程序中,以下语句将在程序内存中恢复数据结构:

# reload
with open('storage.bin', 'rb') as f:
    data_struct = cPickle.load(f)
    vectorizer, selector = data_struct['vectorizer'], data_struct['selector']

# do stuff...
vectors = vectorizer.transform(...)
vec_sel = selector.transform(vectors)

您可以简单地使用内置pickle库:

pickle.dump(vectorizer, open("vectorizer.pickle", "wb"))
pickle.dump(selector, open("selector.pickle", "wb"))

并加载:

vectorizer = pickle.load(open("vectorizer.pickle", "rb"))
selector = pickle.load(open("selector.pickle", "rb"))

Pickle将对象序列化到磁盘,并在需要时将它们重新加载到内存中

pickle lib docs

相关问题 更多 >

    热门问题