<p>首先,我们引入一个唯一的<code>ID</code>,并使用<a href="https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Interval.html" rel="nofollow noreferrer">^{<cd2>}</a>:</p>
<pre><code>df['ID'] = range(df.shape[0])
df['Interval'] = df.apply(lambda x: pd.Interval(x['from'], x['to'], closed='both'), axis=1)
</code></pre>
<p>在此之后,我们将df与自身结合并计算重叠部分:</p>
^{pr2}$
<p>现在我们知道哪些id重叠,但不知道哪个id构建了一个连接的组件。一般来说,这不能通过像重新排序这样的简单算法来完成,但是位<a href="https://en.wikipedia.org/wiki/Component_(graph_theory)" rel="nofollow noreferrer">graph theory</a>有帮助。所以我们建立了一个图</p>
<p><code>graph = connected.groupby(['id', 'ID_x']).agg(list)</code></p>
<p>通过<a href="https://en.wikipedia.org/wiki/Depth-first_search" rel="nofollow noreferrer">depth first search</a>计算连通分量</p>
<pre><code>def connections(graph, id):
def dict_to_df(d):
df = pd.DataFrame(data=[d.keys(), d.values()], index=['ID', 'Subgraph']).T
df['id'] = id
return df[['id', 'Subgraph', 'ID']]
def dfs(node, num):
visited[node] = num
for _node in graph.loc[node].iloc[0]:
if _node not in visited:
dfs(_node, num)
visited = {}
graph = graph.loc[id]
for (num, node) in enumerate(graph.index):
if node not in visited:
dfs(node, num)
return dict_to_df(visited)
dfs = []
for id in graph.index.get_level_values(0).unique():
dfs.append(connections(graph, id))
conns = pd.concat(dfs)
</code></pre>
<p><code>conns</code>包含连接的组件,我们可以将它们组合在一起:</p>
<p><code>data = df.merge(conns[['Subgraph', 'ID']], on=['ID'])</code></p>
<p>最后一个任务是选择要保留的行:</p>
<pre><code>def select_min(x):
m = x['value'].min()
if len(x) > 1 and (x['value'] == m).all():
return -1
else:
return x['value'].idxmin()
selected = data.groupby(['id', 'Subgraph'])['value', 'ID'].apply(select_min)
selected = selected[selected >= 0]
</code></pre>
<p>现在我们完成了:</p>
<pre><code>print(df.loc[df.ID.isin(selected), :].drop(columns=['ID', 'Interval']))
id hit from to value
0 A hit1 56 102 8.500000e-04
3 C hit4 332 480 3.400000e-15
4 D hit5 291 512 3.800000e-24
9 D hit10 514 600 2.100000e-03
</code></pre>