基于子集工作过慢的递归python匹配算法

# takes in a list of length n and returns a list of all combos of subsets of depth n def arbSubsets(seq, n): return list(itertools.combinations(seq, len(seq)-n)) # takes in a tagsList and check Gapper.objects.all to see if any gapper has all those tags def exactMatches(tagsList): tagsSet = set(tagsList) exactMatches = [] for gapper in Gapper.objects.all(): gapperSet = set(gapper.tags.names()) if tagsSet.issubset(gapperSet): exactMatches.append(gapper) return exactMatches # takes in tagsList that has been cleaned to remove any tags that NO gappers have and then checks gapper objects to find optimal match def matchGapper(tagsList, depth, results): # handles the case where we're only given tags contained by no gappers if depth == len(tagsList): return [] # counter variable is to measure complexity for debugging counter += 1 # we don't want too many results or it stops feeling tailored upper_limit_results = 3 # now we must check subsets for match subsets = arbSubsets(tagsList, depth) for subset in subsets: counter += 1 matches = exactMatches(subset) if matches: for match in matches: counter += 1 # new need to check because we might be adding depth 2 to results from depth 1 # which we didn't do before, to make sure we have at least 3 results if match not in results: # don't want to show too many or it doesn't feel tailored anymore counter += 1 if len(results) > upper_limit_results: break results.append(match) # always give at least 3 results if len(results) > 2: return results else: # check one level deeper (less specific) into tags if not enough gappers that match to get more results counter += 1 return matchGapper(tagsList, depth + 1, results) # this is the list of matches we then return to the user matches = matchGapper(tagsList, 0, [])

2条回答

网友

1楼 · 编辑于 2024-09-27 19:26:36

看起来你没有做几百个计算步骤。事实上，对于每个深度，您有几百个选项，因此您不应该添加，而应该乘以每个深度的步数来估计解决方案的复杂性

此外，这句话：This or adapting the stable marriage problem, I don't think any of those will work because it's a small dataset显然也不正确。虽然这些算法对于一些非常简单的情况来说可能有些过分，但它们仍然有效，并将对它们起作用

网友

2楼 · 编辑于 2024-09-27 19:26:36

好吧，在反复摆弄计时器之后，我终于明白了。匹配时有几个功能：exactMatches、matchGapper和arbSubset。当我将计数器放入全局变量并测量操作时（以执行的我的代码的行来测量，对于大输入约为2-10K（约为10个标记））

的确，返回子集列表的arbSubset一开始似乎是一个看似合理的瓶颈。但是如果你仔细观察，我们1）处理少量的标记（顺序为10-50），更重要的是，2）我们在递归matchGapper时只调用arbSubset，这最多只发生10次，因为tagsList只能在10左右（顺序为10-50，如上所述）。当我检查生成仲裁子集所需的时间时，它的顺序是2e-5。因此，生成任意大小的子集所花费的总时间仅为2e-4。换句话说，不是web应用程序中5-30秒等待时间的来源

因此，撇开这一点不谈，我知道arbSubset只被调用了10次，而且调用速度很快，而且知道在我的代码中最多只进行了10K次计算，我开始清楚地意识到我必须使用一些开箱即用的函数，我不知道像set（）或.issubset（）或者类似的东西，需要大量的时间来计算，并且执行了很多次。在更多的地方添加一些计数器，很明显exactMatch（）占发生的所有计算的95-99%左右（如果我们必须检查各种大小的子集的所有组合以获得exactMatch，这是意料之中的）

因此，在这一点上，问题归结为这样一个事实：exactMatch在实现时大约需要0.02秒（经验上），并且被称为数千次。因此，我们可以尝试通过几个数量级使其更快（这已经是非常优化的），或者采取另一种不涉及使用子集查找匹配的方法。我的一个朋友建议创建一个包含所有标记组合（so2^len（tagsList）键）的dict，并将它们设置为具有该精确组合的已注册配置文件列表。这样，查询就是遍历一个（巨大的）dict，这可以很快完成。欢迎提出任何其他建议

相关问题更多 >

编程相关推荐

热门问题

热门文章