擅长:python、mysql、java
<p>对于spacy 1.x,将Google news vectors加载到gensim并转换为新格式(txt中的每一行都包含一个向量:string,vec):</p>
<pre><code>from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model.wv.save_word2vec_format('googlenews.txt')
</code></pre>
<p>删除.txt的第一行:</p>
<pre><code>tail -n +2 googlenews.txt > googlenews.new && mv -f googlenews.new googlenews.txt
</code></pre>
<p>将txt压缩为.bz2:</p>
<pre><code>bzip2 googlenews.txt
</code></pre>
<p>创建与SpaCy兼容的二进制文件:</p>
<pre><code>spacy.vocab.write_binary_vectors('googlenews.txt.bz2','googlenews.bin')
</code></pre>
<p>将googlenews.bin移到python环境的/lib/python/site packages/spacy/data/en-google-1.0.0/vocab/googlenews.bin。</p>
<p>然后加载字向量:</p>
<pre><code>import spacy
nlp = spacy.load('en',vectors='en_google')
</code></pre>
<p>或稍后加载:</p>
<pre><code>nlp.vocab.load_vectors_from_bin_loc('googlenews.bin')
</code></pre>