使用Multi-processing和ssdeep Python对类似文件进行分组时出现问题

def ssdpComparer(lst, threshold): s = ssdeep() check_file = [] result_data = [] lst1 = lst set_lst = set(lst) print '>>>START' for tup1 in lst1: if tup1 in check_file: continue for tup2 in set_lst: score = s.compare(tup1[0], tup2[0]) if score >= threshold: result_data.append((score, tup1[2], tup2[2])) #Score, GroupID, FileID check_file.append(tup2) set_lst = set_lst.difference(check_file) print """####### DONE #######""" remain_lst = set(lst).difference(check_file) return (result_data, remain_lst) def parallelProcessing(tochunk_list, total_processes, threshold, source_path, mode, REMAINING_LEN = 0): result = [] remainining = [] pooled_lst = [] pair = [] chunks_toprocess = [] print 'Total Files:', len(tochunk_list) if mode == MODE_INTENSIVE: chunks_toprocess = groupWithBlockID(tochunk_list) #blockID chunks elif mode == MODE_THOROUGH: chunks_toprocess = groupSafeLimit(tochunk_list, TOTAL_PROCESSES) #Chunks by processes elif mode == MODE_FAST: chunks_toprocess = groupSafeLimit(tochunk_list) #5000 chunks print 'No. of files group to process: %d' % (len(chunks_toprocess)) pool_obj = Pool(processes = total_processes, initializer = poolInitializer, initargs = [None, threshold, source_path, mode]) pooled_lst = pool_obj.map(matchingProcess, chunks_toprocess) #chunks_toprocess tmp_rs, tmp_rm = getResultAndRemainingLists(pooled_lst) result += tmp_rs remainining += tmp_rm print 'RESULT LEN: %s, REMAINING LEN: %s, P.R.L: %s' % (len(result), len(remainining), REMAINING_LEN) tmp_r_len = len(remainining) if tmp_r_len != REMAINING_LEN and len(result) > 0 : result += parallelProcessing(remainining, total_processes, threshold, source_path, mode, tmp_r_len) else: result += [('','', rf[2]) for rf in remainining] return result def getResultAndRemainingLists(pooled_lst): g_result = [] g_remaining = [] for tup_result in pooled_lst: tmp_result, tmp_remaining = tup_result g_result += tmp_result if tmp_remaining: g_remaining += tmp_remaining return (g_result, g_remaining)

1条回答

网友

1楼 · 发布于 2024-09-24 02:16:22

第一条建议：在您的情况下，没有必要让check_fileas list=>；将其更改为set（）-那么它应该更好（在末尾解释）。在

如果你需要块，也许这样的程序就足够了：

def split_to_chunks(wholeFileList):
    s = ssdeep()
    calculated_chunks = []
    for someFileId in wholeFileList:
        for chunk in calculated_chunks:
            if s.compare(chunk[0], someFileId) > threshold:
                chunk.append(someFileId)
                break
        else: # important: this else is on 'for ' level
            # so if there was no 'break' so someFileId is a base for new chunk:
            calculated_chunks.append( [someFileId] )
    return calculated_chunks

之后，您可以过滤结果：组=过滤器（λx:len（x）>1，结果）剩余=过滤器（λx:len（x）==1，结果）

注意：这个算法假设chunk的第一个元素是“base”。结果的好坏很大程度上取决于ssdeep的行为（我可以想象出一个奇怪的问题：ssdeep有多少是可传递的？）如果这种相似性，那么它应该是。。。在

最坏的情况是如果任何一对s.compare（fileId1，fileId2）的分数不满足阈值条件（那么复杂度是n^2，所以在您的例子中是1.3mln*1.3mln）。在

没有简单的方法来优化这个案例。让我们想象一下这样的情况：s.compare（file1，file2）总是接近于0，那么（据我所知），即使您知道s.compare（A，B）非常低，而s.compare（B，C）非常低，那么您仍然不能说s.compare（A，C）=>；所以您需要进行n*n个操作。在

另一个注意事项：假设您使用了太多的结构和列表，例如：

^{pr2}$

此指令create new set（），并且set_lst和check_file中的所有元素都必须至少接触一次，因为check_file是一个列表，因此无法优化“difference”函数，因此它具有复杂性：len（check_file）*log（len（set_lst））

基本上：如果这些结构在增长（130万，差不多），那么你的计算机需要执行更多的计算。如果使用check_file=set（）而不是[]（list），那么它的复杂性应该是：len（set_lst）+len（check_file）

检查元素是否在python的列表（数组）中也是一样：

if tup1 in check_file:

因为check_fileis list->如果tup1不在列表中，您的cpu需要将tup1与所有元素进行比较，因此复杂性为len（check_file）如果您将check_file更改为set，那么复杂度大约为log2（len（check_file））让差异变得更直观，假设len（*check_file*）=1mln，您需要多少比较？？在

集合：log2（1mln）=log2（1000000）~20

列表：len（check_file）=1mln

相关问题更多 >

编程相关推荐

热门问题

热门文章