kmeans使用python实现集群，内存不足

def simple_pearson(v1,v2): si = [val for val in v1 if val in v2] n = len(si) if n==0: return 0.0 sum1 = 0.0 sum2 = 0.0 sum1_sq = 0.0 sum2_sq = 0.0 p_sum = 0.0 for v in si: sum1+=v1[v] sum2+=v2[v] sum1_sq+=pow(v1[v],2) sum2_sq+=pow(v2[v],2) p_sum+=(v1[v]*v2[v]) # Calculate Pearson score num = p_sum-(sum1*sum2/n) temp = (sum1_sq-pow(sum1,2)/n) * (sum2_sq-pow(sum2,2)/n) if temp < 0.0: temp = -temp den = sqrt(temp) if den==0: return 1.0 r = num/den return r

def kcluster(prefs,k=21,max_iterations=50): from collections import defaultdict users = prefs.keys() centroids = [prefs[u] for u in random.sample(users, k)] lastmatches = None for t in range(max_iterations): print 'Iteration %d' % t bestmatches = [[] for i in range(k)] # Find which centroid is closest for each row for j in users: row = prefs[j] bestmatch=(0,2.0) for i in range(k): d = distance(row,centroids[i]) if d <= bestmatch[1]: bestmatch = (i,d) bestmatches[bestmatch[0]].append(j) # If the results are the same as last time, this is complete if bestmatches == lastmatches: break lastmatches=bestmatches centroids = [defaultdict(float) for i in range(k)] # Move the centroids to the average of their members for i in range(k): len_best = len(bestmatches[i]) if len_best > 0: for user_id in bestmatches[i]: row = prefs[user_id] for m in row: centroids[i][m]+=row[m] for key in centroids[i].keys(): centroids[i][key]/=len_best # We may have made the centroids quite dense which significantly # slows down subsequent iterations, so we delete values below a # threshold to speed things up if centroids[i][key] < 0.001: del centroids[i][key] return centroids, bestmatches

2条回答

网友

1楼 · 编辑于 2024-09-27 17:58:31

您的centroids不需要是实际的列表。在

除了centroids[i][m]，您似乎从未引用过其他任何内容。如果您只想要centroids[i]，那么它可能不需要是一个列表；一个简单的字典就可以了。在

    centroids = defaultdict(float)

    #  Move the centroids to the average of their members
    for i in range(k):
        len_best = len(bestmatches[i])

        if len_best > 0:             
            items = set.union(*[set(prefs[u].keys()) for u in bestmatches[i]])

            for user_id in bestmatches[i]:
                row = prefs[user_id]
                for m in items:
                    if row[m] > 0.0: centroids[m]+=(row[m]/len_best)

可能会更好。在

网友

2楼 · 编辑于 2024-09-27 17:58:31

并非所有这些观察结果都与您所表达的问题直接相关，但是….：

a.如图所示，为什么在prefs中的键是long？除非你有几十亿的用户，简单的INT就可以了，而且可以节省一点内存。在

b.您的代码：

centroids = [prefs[random.choice(users)] for i in range(k)]

可以给你重复（两个相同的质心），这反过来又不会使K-means算法满意。用更快更结实的

^{pr2}$

c.在你发布的代码中，你调用了一个函数simple_pearson，你从来没有在任何地方定义过这个函数；我假设你的意思是调用sim_func，但同时又不得不猜测你发布的代码与任何实际运行的代码有何不同，这确实很难帮助解决不同的问题

还有一个迹象表明，这个发布的代码可能不同于任何实际工作的代码：您设置了bestmatch=(0,0)，但是使用if d < bestmatch[1]:进行测试，测试怎么会成功？距离函数是否返回负值？在

e.defaultdict的意义在于，仅仅访问row[m]就神奇地在索引m的row中添加一个项（通过调用defaultdict的工厂获得的值，这里是0.0）。那个项目将永远占据记忆。您绝对不需要这种行为，因此您的代码：

  row = prefs[user_id]                    
  for m in items:
      if row[m] > 0.0: centroids[i][m]+=(row[m]/len_best)

正在浪费大量内存，使prefs从原来的稀疏矩阵变成一个密集矩阵（大部分都是0.0值）。如果你改为编码

  row = prefs[user_id]                    
  for m in row:
      centroids[i][m]+=(row[m]/len_best)

在row中没有增长，因此在{}中没有增长，因为您正在循环row已经拥有的键。在

可能还有许多其他类似的问题，主要类似于最后一个问题，或者以次要问题为例

f.不要用len_best：在循环外计算它的逆1，然后乘以这个逆数。你不需要在循环内做乘法运算，你可以在一个单独的末尾做，因为它是相同的值乘以每一项，这既不节省内存，又避免了肆意浪费CPU时间；-）。好吧，我想这是两个小问题，而不仅仅是一个；-）。在

正如我所提到的，可能还有很多其他问题，但是由于这六个（或七个）已经显示出问题的密度，再加上S.Lott已经提出的单独建议（我认为这不会解决您的主要内存不足问题，因为他的代码仍然通过太多不包含的键来处理rowdefaultdict），我认为继续寻找更多的可能不是很有成效的，也许从解决这些问题开始，如果问题仍然存在，就单独提出一个关于这些问题的问题。。。？在

注意：此问题底部的更新/解决方案

更新：最终算法

相关问题更多 >

编程相关推荐

热门问题

热门文章