从依赖于其他行中多个值的行中删除行

length qstart qend sstart send 0 5464 1 5459 1 5460 1 400 3619 4015 4654 4258 2 396 4261 4653 4012 3619 3 203 1210 1411 1086 1287 4 203 5486 5689 5490 5693 5 100 5500 5600 5310 5410

df = pd.DataFrame({'length': {0: 5464, 1: 400, 2: 396, 3: 203, 4: 203, 5:100}, 'qstart': {0: 1, 1: 3619, 2: 4261, 3: 1210, 4: 5486, 5:5500}, 'qend': {0: 5459, 1: 4015, 2: 4653, 3: 1411, 4: 5689, 5:5600}, 'sstart': {0: 1, 1: 4654, 2: 4012, 3: 1086, 4: 5490, 5:5310}, 'send': {0: 5460, 1: 4258, 2: 3619, 3: 1287, 4: 5693, 5:5410}}) removeRows=[] for i in range(len(df.index)-1): for j in range(i,len(df.index)): if df.iloc[j]['qstart']>df.iloc[i]['qstart']: if df.iloc[j]['qend']<df.iloc[i]['qend']: removeRows.append(j) print(df[~df.index.isin(removeRows)])

length qstart qend sstart send 0 5464 1 5459 1 5460 1 400 3619 4015 4654 4258 2 396 4261 4653 4012 3619 3 203 1210 1411 1086 1287 4 203 5486 5689 5490 5693 5 100 5500 5600 5310 5410

3条回答

网友

1楼 · 编辑于 2024-09-25 18:17:10

解决方案1

这会产生预期的结果，但执行时间与两个for循环相当或较慢。你知道吗

df['remove'] = False            
for i in df.index:
    df['remove'].loc[(~df['remove']) & (df['qstart'] > df.loc[i, 'qstart']) & (df['qend'] < df.loc[i, 'qend'])] = True
ddf = df.loc[~df['remove']]

我首先添加一个名为'remove'的列，每个元素都设置为False，用于跟踪要删除的行。
索引上的循环根据您的条件更改为True列的元素'remove'。每一行都是这样。
然后您可以通过选择'remove'所在的所有行False来创建一个新的数据帧ddf。你知道吗

解决方案2

类似但更快的解决方案是循环行的组合：

from itertools import combinations
df['remove'] = False
for i, j in combinations(df.index, 2):
    if not df.loc[j, 'remove']:
        df.loc[j, 'remove'] = df.loc[j, 'qstart'] > df.loc[i, 'qstart'] and df.loc[j, 'qend'] < df.loc[i, 'qend']
ddf = df.loc[~df['remove']])

在概念上类似，但在这里我们选择每一对一次，这加快了执行时间。解决方案1 loc每次选择检查整个数据帧，因此有很多无用的比较。
根据我的测试，这应该比两个for循环快。你知道吗

对于这两种解决方案，使用

pd.options.mode.chained_assignment = None

提高执行时间。你知道吗

网友

2楼 · 编辑于 2024-09-25 18:17:10

i = 0
while i < len(df):
    qstart = df['qstart'].iloc[i]
    qend = df['qend'].iloc[i]
    df = df.query('qstart <= @qstart or qend >= @qend')
    i += 1

网友

3楼 · 编辑于 2024-09-25 18:17:10

其他可能的解决方案是使用

df.iterrows()

并实现if语句来检查所需的条件：

start = 0
end = 0
for x in df.iterrows(): 
    next_start = x[1]["qstart"]
    next_end = x[1]["qend"]
    if (start < next_start) & (end > next_end):
        df.drop(x[0], inplace = True)
    else:
        start = next_start.copy()
        end = next_end.copy()

然后可以使用

df.sort_values(by = "length")

预期

解决方案1

解决方案2

相关问题更多 >

编程相关推荐

热门问题

热门文章