按最大重叠百分比和值筛选整数范围列表

2024-10-02 08:27:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个范围列表,例如:

[12-48,40-80,60-105,110-130,75-400]

我需要过滤掉或删除重叠超过x位的范围(例如,重叠超过10位)和/或重叠超过x%(假设20%)的最小比较范围。你知道吗

目前,我使用for循环一次检查每个范围,并将其与下一个范围进行比较,以查看它们是否重叠超过了我声明的限制,如果是,请将其删除。这与我展示的示例不同,我得到以下结果:

[12-48,75-400]

范围[40-80]不应该被删除,因为它没有与我们剩下的两个超过极限的范围重叠,但是因为它重叠了[60-105],并且是2个范围中较小的一个,所以被删除了。正确的剩余范围应为:

[12-48,40-80,75-400]

我不认为一个简单的for循环是这里的解决方案,但我不知所措。如果有什么不清楚的请告诉我。你知道吗

当前代码

带有GeneA/GenePrev/GeneAND的部分是我计算重叠百分比的方式,可以忽略。你知道吗

        start = int(key.split(',')[0])
        stop = int(key.split(',')[1])
        length = stop - start
        if First == True:
            Both_Frames[key] = value
            First = False
            GeneA[start:stop] = [1] * (stop - start)
            GenePrev = GeneA
            PrevStart = start
            PrevStop = stop
            prevlength = PrevStop - PrevStart
        else:
            GeneA[start:stop] = [1] * (stop - start)
            Gene_AND = GenePrev & GeneA

            if start == PrevStart:
                GenePrev = GeneA

                ######Need to delete item from dictionary which is overlapping
                Both_Frames.popitem(last=False)
                Both_Frames[key] = value
                PrevStart = start
                PrevStop = stop
                prevlength = PrevStop - PrevStart
            elif start >= PrevStart and stop <= PrevStop:

                continue
            elif  np.count_nonzero(Gene_AND) <= (length * OverLapPercentage) and np.count_nonzero(Gene_AND) <= OverLapNT:
                GenePrev = GeneA
                Both_Frames[key] = value
                PrevStart = start
                PrevStop = stop
                prevlength = PrevStop - PrevStart

            elif np.count_nonzero(Gene_AND) >= (length * OverLapPercentage) or np.count_nonzero(Gene_AND) >= OverLapNT:
                if length > prevlength:
                    GenePrev = GeneA

                    Both_Frames.popitem(last=False)
                    Both_Frames[key] = value
                    PrevStart = start
                    PrevStop = stop
                    prevlength = PrevStop - PrevStart

Tags: andkeyframesvaluenpstartlengthstop
1条回答
网友
1楼 · 发布于 2024-10-02 08:27:12

我可以给你一个复杂的解决方案:

首先,我将您的范围转换为listtuplesint

import pandas as pd


r = ["12-48", "40-80", "60-105", "110-130", "75-400"]
r = [tuple(map(int, z.split("-"))) for z in r]

# [(12, 48), (40, 80), (60, 105), (110, 130), (75, 400)]

然后,我迭代所有的范围,并删除所有完全由另一个范围封装的范围。例如:(110, 130)(75, 400)之内:

hold = []
for idx1 in range(len(r)):
    start_1, stop_1 = r[idx1]
    for idx2, (start_2, stop_2) in enumerate(r):
        if idx1 == idx2:
            continue
        if start_2 < start_1 and stop_1 < stop_2:
            hold.append(idx1)

while hold:
    del r[hold.pop()]

# [(12, 48), (40, 80), (60, 105), (75, 400)]

最后,使用pandas.DataFrame计算重叠和百分比重叠;标记满足排除条件的行(重叠>;10和%>;0.2)。然后按相反的顺序删除这些行,并在每次删除之后再次测试重叠,直到不能删除更多的行为止。你知道吗

然后将数据帧转换回字符串列表,其格式与提供的格式相同。你知道吗

df = pd.DataFrame(r, columns=["start", "stop"]).sort_values("start")

df["length"] = df["stop"] - df["start"]
df["bool_1"], df["bool_2"] = True, True

while any(df["bool_1"].eq(True) & df["bool_2"].eq(True)):
    df["overlap"] = df["stop"] - df["start"].shift(-1)
    df["pc"] = df["overlap"] / df["length"]

    df["bool_1"] = df["overlap"] > 10
    df["bool_2"] = df["pc"] > 0.2
    for i, row in df.sort_index(ascending=False).iterrows():
        if row["bool_1"] == row["bool_2"] and row["bool_1"] is not False:
            df.drop(i, inplace=True)
            break

result = df["start"].astype("str").str.cat(df["stop"].astype("str"), sep="-").to_list()

# ['12-48', '40-80', '75-400']

相关问题 更多 >

    热门问题