<p>如果我理解正确的话,你需要一个在一定范围内的不重复的元组序列。在</p>
<p><strong>编辑</strong><em>0</em>:</p>
<p>我相信你最好的办法是先创造所有可能的组合,然后再洗牌:</p>
<pre><code>import itertools
import random
def random_unique_combinations_k0(items, k):
# generate all possible combinations
combinations = list(itertools.product(*[item for item in items]))
# shuffle them
random.shuffle(combinations)
for combination in itertools.islice(combinations, k):
yield combination
</code></pre>
<p><strong>编辑</strong><em>1</em>:</p>
<p>如果生成所有组合的内存开销太大,您可能需要反复尝试并拒绝非唯一的组合。
一种方法是:</p>
^{pr2}$
<p>(这也只能在添加<code>combination</code>之前使用列表和检查唯一性来实现,这也会呈现<code>random.shuffle()</code>多余的内容,但从我的测试来看,这比使用<code>set</code>s慢得多。)</p>
<p><strong>编辑</strong><em>2</em>:</p>
<p>可能最不需要内存的方法是实际洗牌生成器,然后在它们上使用<code>itertools.product()</code>。在</p>
<pre><code>import random
import itertools
def pseudo_random_unique_combinations_k(items, k):
# randomize generators
comb_gens = list(items)
for i, comb_gen in enumerate(comb_gens):
random.shuffle(list(comb_gens[i]))
# get the first `num` combinations
combinations = list(itertools.islice(itertools.product(*comb_gens), k))
random.shuffle(combinations)
for combination in itertools.islice(combinations, k):
yield tuple(combination)
</code></pre>
<p>这显然会牺牲一些随机性。在</p>
<p><strong>编辑</strong><em>3</em>:</p>
<p>继@Divakar方法之后,我又编写了另一个版本,它似乎相对高效,但它很可能会受到<code>random.sample()</code>功能的限制。在</p>
<pre><code>import random
import functools
def prod(items):
return functools.reduce(lambda x, y: x * y, items)
def random_unique_combinations_k3(items, k):
max_lens = [len(list(item)) for item in items]
max_num_combinations = prod(max_lens)
for i in random.sample(range(max_num_combinations), k):
index_combination = []
for max_len in max_lens:
index_combination.append(i % max_len)
i = i // max_len
yield tuple(item[i] for i, item in zip(index_combination, items))
</code></pre>
<p><strong>测试</strong></p>
<p>在请求的输入上,它们执行得相当快,<code>0</code>方法是最快的(甚至比<code>2</code>或{<cd8>}方法更快),而{<cd9>}方法是最慢的,<code>3</code>方法介于两者之间。
<code>sklearn.model_selection.ParameterSampler</code>方法的速度与<code>1</code>方法相当。在</p>
<pre><code>items = [v for k, v in hyperparams.items()]
num = 100
%timeit list(random_unique_combinations_k0(items, num))
615 µs ± 4.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(random_unique_combinations_k1(items, num))
2.51 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(pseudo_random_unique_combinations_k(items, num))
179 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit list(random_unique_combinations_k3(items, num))
570 µs ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# the `sklearn` method which is slightly different in that it is
# also accessing the underling dictiornary
import from sklearn.model_selection import ParameterSampler
%timeit list(ParameterSampler(hyperparams, n_iter=num))
2.86 ms ± 171 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
</code></pre>
<p>作为补充说明,我将确保您<code>hyperparams</code>是一个<code>collections.OrderedDict</code>,因为{<cd15>}不能保证跨不同版本的Python排序。在</p>
<p>对于稍微大一点的物体,我们开始看到极限:</p>
<pre><code>items = [range(50)] * 5
num = 1000
%timeit list(random_unique_combinations_k0(items, num))
# Memory Error
%timeit list(random_unique_combinations_k1(items, num))
19.3 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(pseudo_random_unique_combinations_k(items, num))
1.82 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit list(random_unique_combinations_k3(items, num))
2.31 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
</code></pre>
<p>更大的物体更是如此:</p>
<pre><code>items = [range(50)] * 50
num = 1000
%timeit list(random_unique_combinations_k0(items, num))
# Memory Error
%timeit list(random_unique_combinations_k1(items, num))
149 ms ± 3.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(pseudo_random_unique_combinations_k(items, num))
4.92 ms ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(random_unique_combinations_k3(items, num))
# OverflowError
</code></pre>
<p><strong>摘要</strong>:</p>
<p>方法<code>0</code>可能不适合内存,方法<code>1</code>最慢,但可能更健壮,方法{<cd10>}如果不遇到溢出问题,它的性能最好,而方法<code>2</code>(<code>pseudo</code>)是最快和内存消耗较少的方法,但它会产生一些“较少随机”的组合。在</p>