我有以下建议:
我想做的是计算元素组合的频率。例如:
统计单个项目和组合项目的所有频率,并仅保留频率大于等于的单个项目和组合项目n、 其中n是任意正整数。对于本例,假设n->;{1, 2, 3, 4}.
我一直在尝试使用以下代码:
# candidates itemsets
records = []
# generates a list of lists of products that were bought together (convert df to list of lists)
for i in range(0, num_records):
records.append([str(data.values[i,j]) for j in range(0, len(data.columns))])
# clean list (delete NaN values)
records = [[x for x in y if str(x) != 'nan'] for y in records]
OUTPUT:
[['detergent'],
['bread', 'water'],
['bread', 'umbrella', 'milk', 'diaper', 'beer'],
['detergent', 'beer', 'umbrella', 'milk'],
['cheese', 'detergent', 'diaper', 'umbrella'],
['umbrella', 'water', 'beer'],
['umbrella', 'water'],
['water', 'umbrella'],
['diaper', 'water', 'cheese', 'beer', 'detergent', 'umbrella'],
['umbrella', 'cheese', 'detergent', 'water', 'beer']]
然后:
setOfItems = []
newListOfItems = []
for item in records:
if item in setOfItems:
continue
setOfItems.append(item)
temp = list(item)
occurence = records.count(item)
temp.append(occurence)
newListOfItems.append(temp)
OUTPUT:
['detergent', 1]
['bread', 'water', 1]
['bread', 'umbrella', 'milk', 'diaper', 'beer', 1]
['detergent', 'beer', 'umbrella', 'milk', 1]
['cheese', 'detergent', 'diaper', 'umbrella', 1]
['umbrella', 'water', 'beer', 1]
['umbrella', 'water', 1]
['water', 'umbrella', 1]
['diaper', 'water', 'cheese', 'beer', 'detergent', 'umbrella', 1]
['umbrella', 'cheese', 'detergent', 'water', 'beer', 1]
如您所见,它只计算整行的频率(来自图1),但是我的预期输出是第二幅图中显示的输出
有趣的问题!我使用
itertools.combinations()
生成所有可能的组合,并collections.Counter()
计算每个组合出现的频率:在
collections.Counter()
上的文档:https://docs.python.org/2/library/collections.html#collections.Counter
在
itertools.combinations()
上的文档:https://docs.python.org/2/library/itertools.html#itertools.combinations
相关问题 更多 >
编程相关推荐