在Python中为创建一个一起评级的项目列表而优化算法

pseudo: for each event(customer,item)(sorted by item): add user to users dict if not exists, and add the items add item to items dict if not exists, and add the user ---------- for item,user in rows: # add the user to the users dict if they don't already exist. users[user]=users.get(user,[]) # append the current item_id to the list of items rated by the current user users[user].append(item) if item != last_item: # we just started a new item which means we just finished processing an item # write the userlist for the last item to the usersForItem dictionary. if last_item != None: usersForItem[last_item]=userlist userlist=[user] last_item = item items.append(item) else: userlist.append(user) usersForItem[last_item]=userlist

relatedItems = {} for key,listOfUsers in usersForItem.iteritems(): relatedItems[key]={} related=[] for ux in listOfReaders: for itemRead in users[ux]: if itemRead != key: if itemRead not in related: related.append(itemRead) relatedItems[key][itemRead]= relatedItems[key].get(itemRead,0) + 1 calc jaccard/tanimoto similarity between relatedItems[key] and its values

3条回答

网友

1楼 · 编辑于 2024-09-30 16:31:01

你真的需要预先计算所有可能的对吗？如果你懒洋洋地做，也就是按需办事呢？在

可以用二维矩阵表示。行对应于客户，列对应于产品。在

每个条目都是0或1，表示与列对应的产品是否由行对应的客户购买。在

如果你把每一列看作（大约5000）0和1的向量，那么两个乘积一起购买的次数就是相应向量的点乘！在

因此，你可以先计算这些向量，然后根据需要懒洋洋地计算点积。在

要计算点积：

现在，只有0和1的向量的一个好的表示是一个整数数组，它基本上是一个位图。在

对于5000个条目，需要79个64位整数的数组。在

因此，给定两个这样的数组，您需要计算常见的1的数量。在

要计算两个整数共有的位数，首先可以按位计算，然后再计算结果数中设置的1的数量。在

为此，您可以使用查找表或一些位计数方法（不确定python是否支持它们），例如：http://graphics.stanford.edu/~seander/bithacks.html

所以你的算法是这样的：

为每个产品初始化79个64位整数的数组。
对于每个客户，查看购买的产品并在相应的产品中为该客户设置适当的位。
现在给出两个产品的查询，你需要知道一起购买它们的客户数量，只需按照上面描述的dot产品。

这应该相当快。在

作为进一步的优化，您可以考虑将客户分组。在

网友

2楼 · 编辑于 2024-09-30 16:31:01

保罗的答案也许是最好的，但以下是我在午休时想到的（诚然，这还没有经过测试，但仍然是一个有趣的思考练习）。不确定我的算法是否快速/优化。我个人建议看看类似MongoDB的NoSQL数据库，因为它似乎可以很好地解决此类问题（map/reduce等等）

# assuming events is a dictionary of id keyed to item bought...
user = {}
for (cust_id, item) in events:
    if not cust_id in users:
        user[cust_id] = set()
    user[cust_id].add(item)
# now we have a dictionary of cust_ids keyed to a set of every item
# they've ever bought (given that repeats don't matter)
# now we construct a dict of items keyed to a dictionary of other items
# which are in turn keyed to num times present
items = {}
def insertOrIter(d, k, v):
    if k in d:
        d[k] += v
    else:
        d[k] = v
for key in user:
    # keep track of items bought with each other
    itemsbyuser = []
    for item in user[key]:
        # make sure the item with dict is set up
        if not item in items:
            items[item] = {}
        # as we see each item, add to it others and others to it
        for other in itemsbyuser:
            insertOrIter(items[other], item, 1)
            insertOrIter(items[item], other, 1)
        itemsbyuser.append(item)
# now, unless i've screwed up my logic, we have a dictionary of items keyed
# to a dictionary of other items keyed to how many times they've been
# bought with the first item. *whew* 
# If you want something more (potentially) useful, we just turn that around to be a
# dictionary of items keyed to a list of tuples of (times seen, other item) and
# you're good to go.
useful = {}
for i in items:
    temp = []
    for other in items[i]:
        temp[].append((items[i][other], other))
    useful[i] = sorted(temp, reverse=True)
# Now you should have a dictionary of items keyed to tuples of
# (number times bought with item, other item) sorted in descending order of
# number of times bought together

网友

3楼 · 编辑于 2024-09-30 16:31:01

events = """\
1-hammer 
1-screwdriver 
1-nails 
2-hammer 
2-nails 
3-screws 
3-screwdriver 
4-nails 
4-screws""".splitlines()
events = sorted(map(str.strip,e.split('-')) for e in events)

from collections import defaultdict
from itertools import groupby

# tally each occurrence of each pair of items
summary = defaultdict(int)
for val,items in groupby(events, key=lambda x:x[0]):
    items = sorted(it[1] for it in items)
    for i,item1 in enumerate(items):
        for item2 in items[i+1:]:
            summary[(item1,item2)] += 1
            summary[(item2,item1)] += 1

# now convert raw pair counts into friendlier lookup table
pairmap = defaultdict(dict)
for k,v in summary.items():
    item1, item2 = k
    pairmap[item1][item2] = v

# print the results    
for k,v in sorted(pairmap.items()):
    print k,':',v

给出：

^{pr2}$

（这将按购买事件处理您的初始请求分组项目。要按用户分组，只需将事件列表的第一个键从event number更改为user id。）

相关问题更多 >

编程相关推荐

热门问题

热门文章