kmeans使用python实现集群，内存不足问题的回答

kmeans使用python实现集群，内存不足

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

<h3>注意：此问题底部的更新/解决方案</h3> 作为产品推荐引擎的一部分，我试图根据用户的产品偏好对他们进行细分，首先使用k-means聚类算法。在 我的数据是以下形式的字典： <pre><code>prefs = { 'user_id_1': { 1L: 3.0f, 2L: 1.0f, }, 'user_id_2': { 4L: 1.0f, 8L: 1.5f, }, } </code></pre> 其中产品id是long，评级是float。数据稀疏。我目前有大约60000个用户，其中大多数只对少数产品进行了评级。每个用户的值字典是使用defaultdict（float）实现的，以简化代码。在 我对k-means聚类的实现如下： ^{pr2}$ 据我所知，该算法处理第一部分（将每个用户分配到其最近的质心）很好。在 问题是在进行下一部分时，取每个集群中每个产品的平均评级，并将这些平均评级用作下一次通过的质心。在 基本上，在计算第一个集群（i=0）之前，算法就会在这一行出现内存错误： <pre><code>if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best) </code></pre> 起初除法步骤是在一个单独的循环，但没有更好的表现。在 这是我得到的例外： <pre><code>malloc: *** mmap(size=16777216) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug </code></pre> 任何帮助都将不胜感激。在 <hr/> <h3>更新：最终算法</h3> 多亏了这里的帮助，这是我的固定算法。如果你发现任何明显的错误，请添加评论。在 首先，简单的安培生实现 <pre><code>def simple_pearson(v1,v2): si = [val for val in v1 if val in v2] n = len(si) if n==0: return 0.0 sum1 = 0.0 sum2 = 0.0 sum1_sq = 0.0 sum2_sq = 0.0 p_sum = 0.0 for v in si: sum1+=v1[v] sum2+=v2[v] sum1_sq+=pow(v1[v],2) sum2_sq+=pow(v2[v],2) p_sum+=(v1[v]*v2[v]) # Calculate Pearson score num = p_sum-(sum1*sum2/n) temp = (sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n) if temp < 0.0: temp = -temp den = sqrt(temp) if den==0: return 1.0 r = num/den return r </code></pre> 一种将简单的pearson转换为距离值的简单方法： <pre><code>def distance(v1,v2): return 1.0-simple_pearson(v1,v2) </code></pre> 最后，k-means集群实现： <pre><code>def kcluster(prefs,k=21,max_iterations=50): from collections import defaultdict users = prefs.keys() centroids = [prefs[u] for u in random.sample(users, k)] lastmatches = None for t in range(max_iterations): print 'Iteration %d' % t bestmatches = [[] for i in range(k)] # Find which centroid is closest for each row for j in users: row = prefs[j] bestmatch=(0,2.0) for i in range(k): d = distance(row,centroids[i]) if d <= bestmatch[1]: bestmatch = (i,d) bestmatches[bestmatch[0]].<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(j) # If the results are the same as last time, this is complete if bestmatches == lastmatches: break lastmatches=bestmatches centroids = [defaultdict(float) for i in range(k)] # Move the centroids to the average of their members for i in range(k): len_best = len(bestmatches[i]) if len_best > 0: for user_id in bestmatches[i]: row = prefs[user_id] for m in row: centroids[i][m]+=row[m] for key in centroids[i].keys(): centroids[i][key]/=len_best # We may have made the centroids quite dense which significantly # slows down subsequent iterations, so we delete values below a # threshold to speed things up if centroids[i][key] < 0.001: del centroids[i][key] return centroids, bestmatches </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

并非所有这些观察结果都与您所表达的问题直接相关，但是….： a.如图所示，为什么在prefs中的键是long？除非你有几十亿的用户，简单的INT就可以了，而且可以节省一点内存。在 b.您的代码： <pre><code>centroids = [prefs[random.choice(users)] for i in range(k)] </code></pre> 可以给你重复（两个相同的质心），这反过来又不会使K-means算法满意。用更快更结实的 ^{pr2}$ c.在你发布的代码中，你调用了一个函数<code>simple_pearson</code>，你从来没有在任何地方定义过这个函数；我假设你的意思是调用<code>sim_func</code>，但同时又不得不猜测你发布的代码与任何实际运行的代码有何不同，这确实很难帮助解决不同的问题 还有一个迹象表明，这个发布的代码可能不同于任何实际工作的代码：您设置了<code>bestmatch=(0,0)</code>，但是使用<code>if d < bestmatch[1]:</code>进行测试，测试怎么会成功？距离函数是否返回负值？在 e.defaultdict的意义在于，仅仅访问<code>row[m]</code>就神奇地在索引<code>m</code>的<code>row</code>中添加一个项（通过调用defaultdict的工厂获得的值，这里是0.0）。那个项目将永远占据记忆。您绝对不需要这种行为，因此您的代码： <pre><code> row = prefs[user_id] for m in items: if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best) </code></pre> 正在浪费大量内存，使<code>prefs</code>从原来的稀疏矩阵变成一个密集矩阵（大部分都是0.0值）。如果你改为编码 <pre><code> row = prefs[user_id] for m in row: centroids[i][m]+=(row[m]/len_best) </code></pre> 在<code>row</code>中没有增长，因此在{<cd8>}中没有增长，因为您正在循环<code>row</code>已经拥有的键。在 可能还有许多其他类似的问题，主要类似于最后一个问题，或者以次要问题为例 f.不要用<code>len_best</code>：在循环外计算它的逆1，然后乘以这个逆数。你不需要在循环内做乘法运算，你可以在一个单独的末尾做，因为它是相同的值乘以每一项，这既不节省内存，又避免了肆意浪费CPU时间；-）。好吧，我想这是两个小问题，而不仅仅是一个；-）。在 正如我所提到的，可能还有很多其他问题，但是由于这六个（或七个）已经显示出问题的密度，再加上S.Lott已经提出的单独建议（我认为这不会解决您的主要内存不足问题，因为他的代码仍然通过太多不包含的键来处理<code>row</code>defaultdict），我认为继续寻找更多的可能不是很有成效的，也许从解决这些问题开始，如果问题仍然存在，就单独提出一个关于这些问题的问题。。。？在

kmeans使用python实现集群，内存不足

1 个回答

相关Python问题