如何将句子加载到Python gensim中？

>>> sentences = ['the quick brown fox jumps over the lazy dogs', "Then a cop quizzed Mick Jagger's ex-wives briefly."] >>> x = word2vec.Word2Vec() >>> x.build_vocab([s.encode('utf-8').split( ) for s in sentences]) >>> x.vocab {}

2条回答

网友

1楼 · 编辑于 2024-05-17 05:04:06

A list of ^{} sentences。您还可以从磁盘流式传输数据。

确保是utf-8，然后将其拆分：

sentences = [ "the quick brown fox jumps over the lazy dogs",
"Then a cop quizzed Mick Jagger's ex-wives briefly." ]
word2vec.Word2Vec([s.encode('utf-8').split() for s in sentences], size=100, window=5, min_count=5, workers=4)

网友

2楼 · 编辑于 2024-05-17 05:04:06

就像alKid指出的那样，使它成为utf-8。

谈论另外两件你可能需要担心的事情。

输入太大，正在从文件加载。
删除句子中的停止词。

您可以执行以下操作，而不是将大列表加载到内存中：

import nltk, gensim
class FileToSent(object):    
    def __init__(self, filename):
        self.filename = filename
        self.stop = set(nltk.corpus.stopwords.words('english'))

    def __iter__(self):
        for line in open(self.filename, 'r'):
        ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
        yield ll

然后

sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, window=5, min_count=5, workers=4, hs=1)

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何将句子加载到Python gensim中？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >