排序列表以最大化相邻项之间的相似性

2024-10-06 08:34:49 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一份钥匙清单:

['A', 'B', 'C']

对于每个键,都有一个属性列表:

{
    'A': [2,3],
    'B': [1,2],
    'C': [4]
}

我希望对标签列表进行排序,以便相邻标签共享尽可能多的属性

在上面的示例中AB共享关系2,因此它们应该彼此相邻-而C与它们不共享任何内容,因此它可以去任何地方

因此,本例的可能顺序如下:

["A","B","C"] # acceptable
["A","C","B"] # NOT acceptable
["B","A","C"] # acceptable
["B","C","A"] # NOT acceptable
["C","A","B"] # acceptable
["C","B","A"] # acceptable

水桶

事实上,我更希望通过将它们放入“桶”来表示:

[["A", "B"], ["C"]] # this can represent all four possible orders above.

但是,如果标签属于两个不同的存储桶,则会出现问题:

{
    'A': [2,3],
    'B': [1,2],
    'C': [1,4]
}

我将如何陈述这一点

我可以这样说:

[["A", "B"], ["C", "B"]]

但我需要另一个处理步骤来翻转桶列表 在最后陈述中:

["A", "B", "C"]

除此之外,还可能存在递归嵌套的bucket:

[[["A","B"], ["C"]], ["D"]]

然后这些可能重叠:

[[["A","B"], ["C"]], ["A","D"]]

品质

“接近度”,即解的质量定义为相邻关系的交集之和(质量越高越好):

def measurequality(result,mapping):
    lastKey = None
    quality = 0
    for key in result:
        if lastKey is None:
            lastKey = key
            continue
        quality += len(set(mapping[key]).intersection(mapping[lastKey]))
        lastKey = key
    return quality

# Example determining that the solution ['A', 'B', 'C'] has quality 1:
#measurequality(['A', 'B', 'C'],
#    {
#        'A': [2,3],
#        'B': [1,2],
#        'C': [4]
#    })

暴力强迫

暴力强制并不构成解决方案(实际上,列表包含数千个元素——不过,如果有人得到了比O(n²)更好的暴力强制方法……)

但是,使用暴力强制创建额外的测试用例是可能的:

  • 生成Ln项的列表['A','B','C',...]
  • 为每个项目生成一个关系字典R(在0n之间最多有n个随机数就足够了)
  • 生成所有可能的L排列,并将它们与R一起馈送到measurequality()中,并保留那些具有最大返回值的排列(可能不是唯一的)

用于创建随机测试用例以测试实现的代码:

import string
import random

def randomtestcase(n):
    keys=list(string.ascii_uppercase[0:n])

    minq = 0
    maxq = 0
    while minq == maxq:
        items={}
        for key in keys:
            items[key] = random.sample(range(1,10),int(random.random()*10))

        minq = n*n
        minl = list(keys)
        maxq = 0
        maxl = list(keys)
        for _ in range(0, 1000): # TODO: explicitly construct all possible permutations of keys.
            random.shuffle(keys)
            q = measurequality(keys,items)
            if q < minq:
                minq = q
                minl = list(keys)
            if maxq < q:
                maxq = q
                maxl = list(keys)

    return ( items, minl, maxq )

( items, keys, quality ) = randomtestcase(5)

sortedkeys = dosomething( keys, items )
actualquality = measurequality( sortedkeys, items )
if actualquality < quality:
    print('Suboptimal: quality {0} < {1}'.format(actualquality,quality))

企图

许多“解决方案”中不起作用的一个(非常糟糕,这一个没有初始元素的选择/在结果列表的前置和追加之间的选择,我在其他列表中有):

def dosomething(keys,items):
    result = []
    todo = list(keys)
    result.append(todo.pop())
    while any(todo):
        lastItems = set(items[result[-1]])
        bestScore = None
        bestKey = None
        for key in todo:
            score = set(items[key]).intersection(lastItems)
            if bestScore is None or bestScore < score:
                bestScore = score
                bestKey = key
        todo.remove(bestKey)
        result.append(bestKey)
    return result

例子

还可以查看上文暴力强制部分中的示例生成器。

测试代码尝试以下示例:

def test(description,acceptable,keys,arguments):
    actual = dosomething(keys,arguments)
    if "".join(actual) in acceptable:
        return 0
    print("\n[{0}] {1}".format("".join(keys),description))
    print("Expected: {0}\nBut was: {1}".format(acceptable,actual))
    print("Quality of result: {0}".format(measurequality(actual,arguments)))
    print("Quality of expected: {0}".format([measurequality(a,arguments) for a in acceptable]))
    return 1

print("EXAMPLES")
failures = 0

# Need to try each possible ordering of letters to ensure that the order of keys
# wasn't accidentially already a valid ordering.
for keys in [
        ["A","B","C"],
        ["A","C","B"],
        ["B","A","C"],
        ["B","C","A"],
        ["C","A","B"],
        ["C","B","A"]
    ]:
    failures += test(
        "1. A and B both have 2, C doesn't, so C can go first or last but not in between.",
        ["ABC", "BAC", "CAB", "CBA"],
        keys,
        {
            "A": [2,3],
            "B": [1,2],
            "C": [4]
        })

    failures += test(
        "2. They all have 2, so they can show up in any order.",
        ["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
        keys,
        {
            "A": [2,3],
            "B": [1,2],
            "C": [2]
        })

    failures += test(
        "3. A and B share 2, B and C share 1, so B must be in the middle.",
        ["ABC", "CBA"],
        keys,
        {
            "A": [2,3],
            "B": [1,2],
            "C": [1]
        })

    failures += test(
        "4. Each shares something with each other, creating a cycle, so they can show up in any order.",
        ["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
        keys,
        {
            "A": [2,3],
            "B": [1,2],
            "C": [1,3]
        })

if 0 < failures:
    print("{0} FAILURES".format(failures))

优先权

正如有人问的那样:用于关系的数字没有优先顺序。存在一个优先顺序,但它是一个偏序,而不是一个数字。我只是没有提到它,因为它让问题变得更难

举个例子:

{
    'A': [2,3],
    'B': [1,2],
    'C': [4]
}

可替换为以下内容(使用字母而不是数字,并添加优先级信息):

{
    'A': [('Q',7),('R',5)],
    'B': [('P',6),('Q',6)],
    'C': [('S',5)]
}

注意

  • 优先级仅在列表中有意义,而在列表之间没有意义
  • 两个列表之间共享关系的优先级可能不同
  • 在一个列表中,可能会多次出现相同的优先级

Tags: keyin列表forifitemsresultkeys
3条回答

这是一个Travelling Salesman Problem,一个众所周知的难以优化解决的问题。给出的代码在大约15分钟内解决了10000个简单互连节点(即每个节点有一个或两个关系)的问题。对于互连更为丰富的序列,它的性能较差。下面的测试结果对此进行了探讨

OP提到的优先权的概念没有被探讨

给出的代码包括一个heuristic解决方案、一个对大型node_set来说是最佳但不实用的蛮力解决方案,以及一些简单但可扩展的测试数据生成器,其中一些具有已知的最佳解决方案。启发式为OP的“ABC”示例(我自己的8项node_set)和已知最佳解决方案的可伸缩测试数据找到最佳解决方案

如果性能不够好,至少这是第一次尝试,并且开始了一个测试“研讨会”,以改进启发式

测试结果

>>> python sortbylinks.py

===============================
"ABC" (sequence length 3)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('A', 'B', 'C'), ('C', 'A', 'B'), ('B', 'A', 'C')], ...and more
Top score: 1
"optimise_order" function took 0.0s
Optimal quality: 1

===============================
"Nick 8-digit" (sequence length 8)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('A', 'E', 'F', 'C', 'D', 'G', 'B', 'H')], ...and more
Top score: 6
"optimise_order" function took 0.0s
Optimal quality: 6

简短、相对琐碎的案例似乎没有问题

===============================
"Quality <1000, cycling subsequence, small number of relations" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AAAC', 'AAAL', 'AAAU', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991

===============================
"Quality <1000, cycling subsequence, small number of relations, shuffled" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AADF', 'AAKM', 'AAOZ', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991

这个"Quality <1000, cycling subsequence" (sequence length 501)很有趣。通过使用{0, 1}关系集对节点进行分组,质量分数几乎可以增加一倍。启发式算法找到这个最优序列。质量1000是不太可能的,因为这些双链接组需要每隔一段时间通过单个链接节点相互连接(例如{'AA': {0, 1}, 'AB': {0, 1}, ..., 'AZ': {0, 1}, <...single link here...> 'BA': {1, 2}, 'BB': {1, 2}, ...}

对于每个节点关系很少的测试数据,性能仍然很好

"Quality 400, single unbroken chain, initial solution is optimal" (sequence length 401)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400

===============================
"Quality 400, single unbroken chain, shuffled" (sequence length 401)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400

旅行推销员问题(TSP)的困难之一是知道何时有一个最优解。即使从接近最优或最优的开始,启发式算法似乎也不会更快地收敛

===============================
"10,000 items, single unbroken chain, initial order is optimal" (sequence length 10001)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AOUQ', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 10002
"optimise_order" function took 947.0s
Optimal quality: 10000

当关系数量非常少时,即使有许多节点,性能也相当好,结果可能接近最优

===============================
"Many random relations per node (0..n), n=200" (sequence length 200)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAEO', 'AAHC', 'AAHQ', ...), ...], ...and more
Top score: 6861
"optimise_order" function took 94.0s
Optimal quality: ?

===============================
"Many random relations per node (0..n), n=500" (sequence length 500)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAJT', 'AAHU', 'AABT', ...), ...], ...and more
Top score: 41999
"optimise_order" function took 4202.0s
Optimal quality: ?

这更像是OP生成的数据,也更像是经典的旅行商问题(TSP),其中每个城市对(对于“城市”读取“节点”)之间有一组距离,并且节点通常彼此连接紧密。在这种情况下,节点之间的链接是部分的,因此无法保证任何2个节点之间的链接

在这种情况下,启发式算法的时间性能要差得多。对于n节点,每个节点的随机关系介于0和n之间。这可能意味着更多的互换组合产生更好的质量,互换和质量检查本身成本更高,在启发式收敛到最佳结果之前需要更多的过程。在最坏的情况下,这可能意味着O(n^3)

随着节点和关系数量的增加,性能会下降(请注意n=2003分钟和n=50070分钟之间的差异)。因此,目前启发式算法可能不适用于数千个高度互连的节点

此外,由于暴力解决方案在计算上是不可行的,因此无法准确地知道该测试结果的质量6861 / 200 = 34.341999 / 500 = 84.0节点对之间的平均连接看起来并不太遥远

启发式和蛮力求解器代码

import sys
from collections import deque
import itertools
import random
import datetime

# TODO: in-place swapping? (avoid creating copies of sequences)


def timing(f):
    """
    decorator for displaying execution time for a method
    """
    def wrap(*args):
        start = datetime.datetime.now()
        f_return_value = f(*args)
        end = datetime.datetime.now()
        print('"{:s}" function took {:.1f}s'.format(f.__name__, (end-start).seconds))
        return f_return_value
    return wrap


def flatten(a):
    # e.g. [a, [b, c], d] -> [a, b, c, d]
    return itertools.chain.from_iterable(a)


class LinkAnalysis:
    def __init__(self, node_set, max_ram=100_000_000, generate_seeds=True):
        """
        :param node_set: node_ids and their relation sets to be arranged in sequence
        :param max_ram: estimated maximum RAM to use
        :param generate_seeds: if true, attempt to generate some initial candidates based on sorting
        """
        self.node_set = node_set
        self.candidates = {}
        self.generate_seeds = generate_seeds
        self.seeds = {}
        self.relations = []
        # balance performance and RAM using regular 'weeding'
        candidate_size = sys.getsizeof(deque(self.node_set.keys()))
        self.weed_interval = max_ram // candidate_size

    def create_initial_candidates(self):
        print('Working...')
        self.generate_seed_from_presented_sequence()
        if self.generate_seeds:
            self.generate_seed_candidates()

    def generate_seed_from_presented_sequence(self):
        """
        add initially presented order of nodes as one of the seed candidates
        this is worthwhile because the initial order may be close to optimal
        """
        presented_sequence = self.presented_sequence()
        self.seeds[tuple(self.presented_sequence())] = self.quality(presented_sequence)

    def presented_sequence(self) -> list:
        return list(self.node_set.keys())  # relies on Python 3.6+ to preserve key order in dicts

    def generate_seed_candidates(self):
        initial_sequence = self.presented_sequence()
        # get relations from the original data structure
        relations = sorted(set(flatten(self.node_set.values())))
        # sort by lowest precedence relation first
        print('...choosing seed candidates')
        for relation in reversed(relations):
            # use true-false ordering: in Python, True > False
            initial_sequence.sort(key=lambda sortkey: not relation in self.node_set[sortkey])
            sq = self.quality(initial_sequence)
            self.seeds[tuple(initial_sequence)] = sq

    def quality(self, sequence):
        """
        calculate quality of full sequence
        :param sequence:
        :return: quality score (int)
        """
        pairs = zip(sequence[:-1], sequence[1:])
        scores = [len(self.node_set[a].intersection(self.node_set[b]))
                  for a, b in pairs]
        return sum(scores)

    def brute_force_candidates(self, sequence):
        for sequence in itertools.permutations(sequence):
            yield sequence, self.quality(sequence)

    def heuristic_candidates(self, seed_sequence):
        # look for solutions with higher quality scores by swapping elements
        # start with large distances between elements
        # then reduce by power of 2 until swapping next-door neighbours
        max_distance = len(seed_sequence) // 2
        max_pow2 = int(pow(max_distance, 1/2))
        distances = [int(pow(2, r)) for r in reversed(range(max_pow2 + 1))]
        for distance in distances:
            yield from self.seed_and_variations(seed_sequence, distance)
        # seed candidate may be better than its derived sequences   include it as a candidate
        yield seed_sequence, self.quality(seed_sequence)

    def seed_and_variations(self, seed_sequence, distance=1):
        # swap elements at a distance, starting from beginning and end of the
        # sequence in seed_sequence
        candidate_count = 0
        for pos1 in range(len(seed_sequence) - distance):
            pos2 = pos1 + distance
            q = self.quality(seed_sequence)
            # from beginning of sequence
            yield self.swap_and_quality(seed_sequence, q, pos1, pos2)
            # from end of sequence
            yield self.swap_and_quality(seed_sequence, q, -pos1, -pos2)
            candidate_count += 2
            if candidate_count > self.weed_interval:
                self.weed()
                candidate_count = 0

    def swap_and_quality(self, sequence, preswap_sequence_q: int, pos1: int, pos2: int) -> (tuple, int):
        """
        swap and return quality (which can easily be calculated from present quality
        :param sequence: as for swap
        :param pos1: as for swap
        :param pos2: as for swap
        :param preswap_sequence_q: quality of pre-swapped sequence
        :return: swapped sequence, quality of swapped sequence
        """
        initial_node_q = sum(self.node_quality(sequence, pos) for pos in [pos1, pos2])
        swapped_sequence = self.swap(sequence, pos1, pos2)
        swapped_node_q = sum(self.node_quality(swapped_sequence, pos) for pos in [pos1, pos2])
        qdelta = swapped_node_q - initial_node_q
        swapped_sequence_q = preswap_sequence_q + qdelta
        return swapped_sequence, swapped_sequence_q

    def swap(self, sequence, node_pos1: int, node_pos2: int):
        """
        deques perform better than lists for swapping elements in a long sequence
        :param sequence  sequence on which to perform the element swap
        :param node_pos1  zero-based position of first element
        :param pos2 : zero-based position of second element
        >>> swap(('A', 'B', 'C'), 0, 1)
        ('B', 'A', 'C')
        """
        if type(sequence) is tuple:
            # sequence is a candidate (which are dict keys and hence tuples)
            # needs converting to a list for swap processing
            sequence = list(sequence)
        if node_pos1 == node_pos2:
            return sequence
        tmp = sequence[node_pos1]
        sequence[node_pos1] = sequence[node_pos2]
        sequence[node_pos2] = tmp
        return sequence

    def node_quality(self, sequence, pos):
        if pos < 0:
            pos = len(sequence) + pos
        no_of_links = 0
        middle_node_relations = self.node_set[sequence[pos]]
        if pos > 0:
            left_node_relations = self.node_set[sequence[pos - 1]]
            no_of_links += len(left_node_relations.intersection(middle_node_relations))
        if pos < len(sequence) - 1:
            right_node_relations = self.node_set[sequence[pos + 1]]
            no_of_links += len(middle_node_relations.intersection(right_node_relations))
        return no_of_links

    @timing
    def optimise_order(self, selection_strategy):
        top_score = 0
        new_top_score = True
        self.candidates.update(self.seeds)
        while new_top_score:
            top_score = max(self.candidates.values())
            new_top_score = False
            initial_candidates = {name for name, score in self.candidates.items() if score == top_score}
            for initial_candidate in initial_candidates:
                for candidate, q in selection_strategy(initial_candidate):
                    if q > top_score:
                        new_top_score = True
                        top_score = q
                        self.candidates[tuple(candidate)] = q
                self.weed()
        print(f"Number of candidates with top score: {len(list(self.candidates))}")
        print(f"{list(self.candidates)[:3]}, ...and more")
        print(f"Top score: {top_score}")

    def weed(self):
        # retain only top-scoring candidates
        top_score = max(self.candidates.values())
        low_scorers = {k for k, v in self.candidates.items() if v < top_score}
        for low_scorer in low_scorers:
            del self.candidates[low_scorer]

代码术语表

node_set:一组带有标签的节点,其形式为'unique_node_id': {relation1, relation2, ..., relationN}。每个节点的关系集可以不包含任何关系,也可以包含任意数量的关系

node:由node_id(键)和一组关系(值)组成的键值对

relation:OP使用的是一个数字。如果两个节点都共享关系1,并且它们是序列中的邻居,则会将1添加到序列的质量中

sequence:与quality分数相关联的一组有序节点ID(例如,.[A]、[B]、[C'])。质量分数是序列中节点之间共享关系的总和。启发式的输出是具有最高质量分数的一个或多个序列

candidate:当前正在执行的序列调查看它是否是高质量的

方法

  1. 通过对链接项中是否存在每个关系进行稳定排序来生成种子序列

  2. 初始呈现的顺序也是种子序列之一,以防接近最优

  3. 对于每个种子序列,交换成对的节点以寻找更高的质量分数

  4. 对每个种子序列执行一轮。轮是一种类似于外壳排序的序列传递,首先在一定距离上交换节点对,然后缩小距离,直到距离为1(交换近邻)。仅保留质量高于当前最高质量分数的序列

  5. 如果在这一轮中发现了一个新的最高质量分数,则剔除除最高分数之外的所有候选分数,并使用最高分数作为种子重复4次。否则退出

测试和测试数据生成器

该启发式算法已经通过使用小型node_sets、数百到10000个节点的缩放数据(关系非常简单)和一个随机的、高度互联的node_set进行了测试,更像OP的测试数据生成器。完美的单链接序列、链接循环(内部链接和彼此链接的小子序列)和洗牌都有助于发现和修复弱点

ABC_links = {
    'A': {2, 3},
    'B': {1, 2},
    'C': {4}
}

nick_links = {
    'B': {1, 2, 4},
    'C': {4},
    'A': {2, 3},
    'D': {4},
    'E': {3},
    'F': {5, 6},
    'G': {2, 4},
    'H': {1},
}

unbroken_chain_linked_tail_to_head = ({1, 3}, {3, 4}, {4, 5}, {5, 6}, {6, 7}, {7, 8}, {8, 9}, {9, 10}, {10, 1})
cycling_unbroken_chain_linked_tail_to_head = itertools.cycle(unbroken_chain_linked_tail_to_head)


def many_nodes_many_relations(node_count):
    # data set with n nodes and random 0..n relations as per OP's requirement
    relation_range = list(range(node_count))
    relation_set = (
        set(random.choices(relation_range, k=random.randint(0, node_count)))
        for _ in range(sys.maxsize)
    )
    return scaled_data(node_count, relation_set)


def scaled_data(no_of_items, link_sequence_model):
    uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    # unique labels using sequence of four letters (AAAA, AAAB, AAAC, .., AABA, AABB, ...)
    item_names = (''.join(letters) for letters in itertools.product(*([uppercase] * 4)))
    # only use a copy of the original link sequence model  otherwise the model could be exhausted
    # or start mid-cycle
    # https://stackoverflow.com/questions/42132731/how-to-create-a-copy-of-python-iterator
    link_sequence_model, link_sequence = itertools.tee(link_sequence_model)
    return {item_name: links for _, item_name, links in zip(range(no_of_items), item_names, link_sequence)}


def shuffled(items_list):
    """relies on Python 3.6+ dictionary insertion-ordered keys"""
    shuffled_keys = list(items_list.keys())
    random.shuffle(shuffled_keys)
    return {k: items_list[k] for k in shuffled_keys}


cycling_quality_1000 = scaled_data(501, cycling_unbroken_chain_linked_tail_to_head)
cycling_quality_1000_shuffled = shuffled(cycling_quality_1000)

linked_forward_sequence = ({n, n + 1} for n in range(sys.maxsize))
# { 'A': {0, 1}, 'B': {1, 2}, ... } links A to B to ...
optimal_single_linked_unbroken_chain = scaled_data(401, linked_forward_sequence)
shuffled_single_linked_unbroken_chain = shuffled(optimal_single_linked_unbroken_chain)

large_node_set = scaled_data(10001, cycling_unbroken_chain_linked_tail_to_head)
large_node_set_shuffled = shuffled(large_node_set)


tests = [
    ('ABC', 1, ABC_links, True),
    ('Nick 8-digit', 6, nick_links, True),
    # ('Quality <1000, cycling subsequence, small number of relations', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000, True),
    # ('Quality <1000, cycling subsequence, small number of relations, shuffled', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000_shuffled, True),
    ('Many random relations per node (0..n), n=200', '?', many_nodes_many_relations(200), True),
    # ('Quality 400, single unbroken chain, initial solution is optimal', 400, optimal_single_linked_unbroken_chain, False),
    # ('Quality 400, single unbroken chain, shuffled', 400, shuffled_single_linked_unbroken_chain, True),
    # ('10,000 items, single unbroken chain, initial order is optimal', 10000, large_node_set, False),
    # ('10,000 items, single unbroken chain, shuffled', 10000, large_node_set_shuffled, True),
]
for title, expected_quality, item_links, generate_seeds in tests:
    la = LinkAnalysis(node_set=item_links, generate_seeds=generate_seeds)
    seq_length = len(list(la.node_set.keys()))
    print()
    print('===============================')
    print(f'"{title}" (sequence length {seq_length})')
    print('===============================')
    la.create_initial_candidates()
    print('Finding heuristic candidates...')
    la.optimise_order(la.heuristic_candidates)
    print(f'Optimal quality: {expected_quality}')
    # print('Brute Force working...')
    # la.optimise_order(la.brute_force_candidates)

演出

启发式比暴力解决方案更“实用”,因为它省去了许多可能的组合。可能是元素交换产生的低质量序列实际上距离一次交换后的高质量分数只有一步之遥,但这种情况可能会在测试之前被剔除

该启发式算法似乎可以找到单链序列或首尾相连的循环序列的最优结果。它们有一个已知的最佳解决方案,启发式算法可以找到该解决方案,并且它们可能比实际数据更简单,问题更少

一个巨大的改进是引入了“增量”质量计算,它可以快速计算两元素交换产生的质量差异,而无需重新计算整个序列的质量分数

我在修改你的测试程序,并提出了这个解决方案,它给了我0次失败。虽然感觉像是一种启发,但它确实需要更多的测试和测试用例。函数假定键是唯一的,因此['A', 'A', 'B', ...]列表和arguments字典中不存在所有元素:

def dosomething(_, arguments):

    m = {}
    for k, v in arguments.items():
        for i in v:
            m.setdefault(i, []).append(k)

    out, seen = [], set()
    for _, values in sorted(m.items(), key=lambda k: -len(k[1])):
        for v in values:
            if v not in seen:
                out.append(v)
                seen.add(v)

    return out

编辑:误读质量函数,这映射到可分离的旅行商问题

对于跨所有节点的N节点、P属性和T总属性,这应该能够在O(N + P + T)或更好的情况下解决,具体取决于数据的拓扑结构

让我们将问题转化为一个图,其中任意两个节点之间的“距离”为-(number of shared properties)。没有连接的节点将保持未链接状态。这将至少需要O(T)来创建图形,也许需要另一个O(N + P)来分段

然后通过节点将“排序顺序”转换为“路径”。特别是,您需要最短的路径

此外,您还可以应用多种翻译来提高通用算法的性能和可用性:

  • 将图形分割为不相连的块,并独立求解每个块
  • 重新规范化所有值,以从1..N开始,而不是从-N..1开始(每个图,但实际上并不重要,可以添加|number of properties|

https://en.wikipedia.org/wiki/Component_(graph_theory)#Algorithms

It is straightforward to compute the components of a graph in linear time (in terms of the numbers of the vertices and edges of the graph).

https://en.wikipedia.org/wiki/Shortest_path_problem#Undirected_graphs

Weights  Time complexity   Author
ℝ+       O(V2)             Dijkstra 1959
ℝ+       O((E + V) log V)  Johnson 1977 (binary heap)
ℝ+       O(E + V log V)    Fredman & Tarjan 1984 (Fibonacci heap)
ℕ        O(E)              Thorup 1999 (requires constant-time multiplication).

相关问题 更多 >