<p>改进了我之前的答案,该算法解决了<code>O(n + m)</code>中的问题,其中<code>n</code>是总范围的长度<code>m</code>是子范围的数量</p>
<p>基本思想是只遍历<code>n</code>个数一次,保留当前数所属范围数的计数器。在每一步中,我们检查是否通过了范围起始,在这种情况下,计数器将递增。相反,如果我们已经通过了一个范围停止,计数器就会递减</p>
<p>下面的实际实现将<code>numpy</code>和<code>pandas</code>用于所有繁重的工作,因此该算法的迭代性质似乎不清楚,但它基本上只是我所描述的向量化版本</p>
<p>与我之前回答的600毫秒相比,我的笔记本电脑上10k范围的时间减少到了20毫秒。此外,这里的内存使用率也是<code>O(n + m)</code>,而它在那里<code>O(nm)</code>,因此更大的<code>n</code>和<code>m</code>成为可能。您可能应该使用此解决方案,而不是第一个版本</p>
<pre class="lang-py prettyprint-override"><code>from collections import defaultdict
import numpy as np
import pandas as pd
# Generate data
def generate_ranges(n):
boundaries = np.random.randint(-10_000, 10_000, size=(n, 2))
boundaries.sort(axis=1)
return [range(x, y) for x, y in boundaries]
ranges = generate_ranges(10_000)
# Extract boundaries
boundaries = np.array([[range.start, range.stop] for range in ranges])
# Add a +1 offset for range starts and -1 for range stops
offsets = np.array([1, -1])[None, :].repeat(boundaries.shape[0], axis=0)
boundaries = np.stack([boundaries, offsets], axis=-1)
boundaries = boundaries.reshape(-1, 2)
# Compute range counts at each crossing of a range boundary
df = pd.DataFrame(boundaries, columns=["n", "offset"])
df = df.sort_values("n")
df["count"] = df["offset"].cumsum()
df = df.groupby("n")["count"].max()
# Expand to all integers by joining and filling NaN
index = pd.RangeIndex(df.index[0], df.index[-1] + 1)
df = pd.DataFrame(index=index).join(df).fillna(method="ffill")
# Finally wrap the result in a defaultdict
d = defaultdict(int, df["count"].astype(int).to_dict())
</code></pre>