擅长:python、mysql、java
<p>就像<code>alKid</code>指出的那样,使它成为<code>utf-8</code>。</p>
<p>谈论另外两件你可能需要担心的事情。</p>
<ol>
<li>输入太大,正在从文件加载。</li>
<li>删除句子中的停止词。</li>
</ol>
<p>您可以执行以下操作,而不是将大列表加载到内存中:</p>
<pre><code>import nltk, gensim
class FileToSent(object):
def __init__(self, filename):
self.filename = filename
self.stop = set(nltk.corpus.stopwords.words('english'))
def __iter__(self):
for line in open(self.filename, 'r'):
ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
yield ll
</code></pre>
<p>然后</p>
<pre><code>sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=5, workers=4, hs=1)
</code></pre>