用numpy保存列表时内存不足

print "Collecting Raw Documents, tokenize, and remove stop words" df = pd.read_pickle(path + datasetName + "Train") frequency = defaultdict(int) gen_docs = [] totalArts = len(df) for artNum in range(totalArts): if artNum % 2500 == 0: print "Gen Docs Creation on " + str(artNum) + " of " + str(totalArts) bodyText = df.loc[artNum,"fullContent"] bodyText = re.sub('<[^<]+?>', '', str(bodyText)) bodyText = re.sub(pun, " ", str(bodyText)) tmpDoc = [] for w in word_tokenize(bodyText): w = w.lower().decode("utf-8", errors="ignore") #if w not in STOPWORDS and len(w) > 1: if len(w) > 1: #w = wordnet_lemmatizer.lemmatize(w) w = re.sub(num, "number", w) tmpDoc.append(w) frequency[w] += 1 gen_docs.append(tmpDoc) print len(gen_docs) del df print "Saving unfiltered gen" dataSetName = path + dataSetName np.save("%s_lemmaWords_noStop_subbedNums.npy" % dataSetName, gen_docs)

1条回答

网友

1楼 · 发布于 2024-10-01 07:10:50

np.save首先尝试将输入转换为数组。毕竟，它的设计是为了节省numpy数组。在

如果生成的数组是多维的，包含数值或字符串值（dtype），它将保存一些基本的维度信息以及数组数据缓冲区的内存副本。在

但是如果数组包含其他对象（例如，dtype对象），那么它将pickle这些对象，并保存结果字符串。在

我会努力的

arr = np.array(gen_docs)

这会产生内存错误吗？在

如果不是，它的shape和{}是什么？在

如果tmpDoc（子列表）的长度不同，arr将是一个1d数组，其中的对象是tmpDoc列表。在

只有当所有的tmpDoc都有相同的长度时，它才会产生一个2d数组。即使这样，数据类型也将依赖于元素，无论是数字、字符串还是其他对象。在

我可以补充一点，数组是用save协议来处理的。在

相关问题更多 >

编程相关推荐

热门问题

热门文章