由于我在this page中提出的问题,我的代码内存不足。然后,我编写了第二段代码,使其具有iterable alldocs
,而不是全内存{
此代码读取给定的所有文件夹的所有文件路径。那个每个文件的上下文由一个文档名和它的上下文两部分组成行。用于实例:
clueweb09-en0010-07-00000
dove gif clipart pigeon clip art picture image hiox free birds india web icons clipart add stumble upon
clueweb09-en0010-07-00001
google bookmarks yahoo bookmarks php script java script jsp script licensed scripts html tutorials css tutorials
第一个代码:
# coding: utf-8
import string
import nltk
import nltk.tokenize
from nltk.corpus import stopwords
import re
import os, sys
import MySQLRepository
from gensim import utils
from gensim.models.doc2vec import Doc2Vec
import gensim.models.doc2vec
from gensim.models.doc2vec import LabeledSentence
from boto.emr.emrobject import KeyValue
def readAllFiles(path):
dirs = os.listdir( path )
for file in dirs:
if os.path.isfile(path+"/"+file):
prepareDoc2VecSetting(path+'/'+file)
else:
pf=path+"/"+file
readAllFiles(pf)
def prepareDoc2VecSetting (fname):
mapDocName_Id=[]
keyValues=set()
with open(fname) as alldata:
a= alldata.readlines()
end=len(a)
label=0
tokens=[]
for i in range(0,end):
if a[i].startswith('clueweb09-en00'):
mapDocName_Id.insert(label,a[i])
label=label+1
alldocs.append(LabeledSentence(tokens[:],[label]))
keyValues |= set(tokens)
tokens=[]
else:
tokens=tokens+a[i].split()
mydb.insertkeyValueData(keyValues)
mydb.insertDocId(mapDocName_Id)
mydb=MySQLRepository.MySQLRepository()
alldocs = []
pth='/home/flr/Desktop/newInput/tokens'
readAllFiles(ipth)
model = Doc2Vec(alldocs, size = 300, window = 5, min_count = 2, workers = 4)
model.save(pth+'/my_model.doc2vec')
第二个代码:(我没有考虑与DB相关的部分)
^{pr2}$这是错误:
Traceback (most recent call last): File "/home/flashkar/git/doc2vec_annoy/Doc2Vec_Annoy/KNN/testiterator.py", line 44, in model = Doc2Vec(allDocs, size = 300, window = 5, min_count = 2, >workers = 4) File "/home/flashkar/anaconda/lib/python2.7/site->packages/gensim/models/doc2vec.py", line 618, in init self.build_vocab(documents, trim_rule=trim_rule) File >"/home/flashkar/anaconda/lib/python2.7/site->packages/gensim/models/word2vec.py", line 523, in build_vocab self.scan_vocab(sentences, progress_per=progress_per, >trim_rule=trim_rule) # initial survey File "/home/flashkar/anaconda/lib/python2.7/site->packages/gensim/models/doc2vec.py", line 655, in scan_vocab for document_no, document in enumerate(documents): File >"/home/flashkar/git/doc2vec_annoy/Doc2Vec_Annoy/KNN/testiterator.py", line 40, in iter yield LabeledSentence(tokens[:],tpl1) IndexError: list index out of range
您使用生成器函数是因为您不想存储所有文档,但仍将所有文档存储在
alldocs
中。你可以yield LabeledSentence(tokens[:], tpl[1]]))
。在当前正在发生的是,您正在将列表附加到列表并返回该列表。这就是为什么你会得到AttributeError。另外,在每次迭代中,您都要将其附加到列表中,这意味着在每次迭代中,i,您都将返回i以及在i之前出现的所有文档!在
相关问题 更多 >
编程相关推荐