（记忆有效）将“排序”作为生成器实现问题的回答

（记忆有效）将“排序”作为生成器实现

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我想这不是一个特别新的主题，我想有比我更好的实现：我正在寻找（a）我正在处理的算法类型-它的实际名称或类似名称-和（b）可能更好的实现 一般的问题是：想象一个列表<code>a</code>，它非常长-太长，无法多次放入内存。该列表包含允许排序的事物的“随机”序列（<code><</code>、<code>></code>和<code>==</code>正在工作）。我想按升序遍历列表中的所有条目，包括重复条目，但不复制列表或生成类似“极端”长度的任何内容。我还希望保持<code>a</code>中条目的原始顺序，即，排除就地<code>sort</code>。因此，我基本上希望最小化排序所需的内存，同时不修改原始数据源 Python的sorted不会触及原始数据，而是生成一个新列表，其大小/长度与原始列表相同。因此，我的基本想法是将<code>sorted</code>重新实现为一个生成器： <pre class="lang-py prettyprint-override"><code>def sorted_nocopy_generator(data_list): state_max = max(data_list) state = min(data_list) state_count = data_list.count(state) for _ in range(state_count): yield state index = state_count while index < len(data_list): new_state = state_max for entry in data_list: if state < entry < new_state: new_state = entry state = new_state state_count = data_list.count(state) for _ in range(state_count): yield state index += state_count </code></pre> 它可以按如下方式进行测试： <pre class="lang-py prettyprint-override"><code>import random a_min = 0 a_max = 10000 a = list(range(a_min, a_max)) # test data a.extend((random.randint(a_min, a_max - 1) for _ in range(len(a) // 10))) # some double entries random.shuffle(a) # "random" order a_control = a.copy() # for verification that a is not altered a_test_sorted_nocopy_generator = list(sorted_nocopy_generator(a)) assert a == a_control a_test_sorted = sorted(a) assert a == a_control assert a_test_sorted == a_test_sorted_nocopy_generator </code></pre> 它的标度为O（N^2），就像bubblesort那样。我在寻找什么样的算法？如何优化这件事（可能通过交易一些内存）

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

在这里画草图。其中<code>N = len(data_list)</code>和<code>S = sqrt(N)</code>使用<code>O(S)</code>额外内存并占用最坏情况下的<code>O(N*log(N))</code>时间： <ul> <li>对于原始数据中长度为<code>S</code>的每个连续切片，将该切片复制到临时列表中，使用<code>.sort()</code>对其进行适当排序，然后将结果写入唯一的临时文件。总共将有大约<code>S</code>个临时文件</li> <li>将这些临时文件馈送到<code>heapq.merge()</code>。这是一个生成器，只跟踪当前跨<code>S</code>输入的<code>S</code>最小值，因此此部分也有<code>O(S)</code>内存负担</li> <li>删除临时文件</li> </ul> 您可以使用的内存越多，所需的临时文件就越少，运行速度也就越快 <h2>削减常数因子</h2> 正如评论中所指出的，次二次时间算法的希望渺茫。但是，您可以通过减少数据的传递次数来减少原始算法中的常数因子。这里有一种方法，在每次传递数据时生成下一个<code>K</code>条目。不过，总的来说，它仍然是二次时间 <pre><code>def sorted_nocopy_generator(data_list, K=100): import itertools from bisect import insort assert K >= 1 total = 0 too_small = None while total < len(data_list): active = [] # hold the next K entries entry2count = {} for entry in data_list: if entry in entry2count: entry2count[entry] += 1 elif ((too_small is None or too_small < entry) and (len(active) < K or entry < active[-1])): insort(active, entry) entry2count[entry] = 1 if len(active) > K: # forget the largest del entry2count[active.pop()] for entry in active: count = entry2count[entry] yield from itertools.repeat(entry, count) total += count too_small = active[-1] </code></pre> <h2>消除最坏的情况</h2> 正如@btilly的回答一样，上面代码中最糟糕的情况可以通过使用max堆来避免。然后将新条目添加到<code>active</code>具有最坏情况时间<code>O(log(K))</code>，而不是<code>O(K)</code> 幸运的是，<code>heapq</code>模块已经提供了一些可用于此目的的东西。但是，处理重复数据就成了一件令人头痛的事情——没有公开max heap<code>heapq</code>所使用的隐藏式堆 因此，下面的特殊情况是感兴趣的最小<code>K</code>项中的最大项，使用<code>.count()</code>（如在原始程序中）进行完整传递以计算有多少个 但是，它不需要对每个惟一元素执行该操作，而只需要对每个<code>K</code>元素执行一次 额外内存使用与<code>K</code>成正比 <pre><code>def sorted_nocopy_generator(data_list, K=100): import itertools from heapq import nsmallest assert K >= 1 too_small = None ntodo = len(data_list) while ntodo: if too_small is None: active = nsmallest(K, data_list) else: active = nsmallest(K, (x for x in data_list if x > too_small)) too_small = active[-1] for x in active: if x == too_small: break yield x ntodo -= 1 count = data_list.count(too_small) yield from itertools.repeat(too_small, count) ntodo -= count assert ntodo >= 0 </code></pre>

（记忆有效）将“排序”作为生成器实现

1 个回答

相关Python问题