选择pd.数据帧在深度值方面具有最大交集的行(特定列)

2024-10-02 02:37:12 发布

您现在位置:Python中文网/ 问答频道 /正文

+------------+-----+--------+-----+-------------+
| Meth.name  |  Min| Max    |Layer| Global name |
+------------+-----+--------+-----+-------------+
|   DTS      | 2600| 3041.2 | AC1 |  DTS        |
|   GGK      | 1800| 3200.0 | AC1 |  DEN        |
|   DTP      | 700 | 3041.0 | AC2 |  DT         |
|   DS       | 700 | 3041.0 | AC3 |  CALI       |
|   PF1      | 2800| 3012.0 | AC3 |  CALI       |
|   PF2      | 3000| 3041.0 | AC4 |  CALI       |
+------------+-----+--------+-----+-------------+

我们必须按“Global name”列删除重复的行,但具体方式是:我们要选择行,它将给出与range的最大交集,range是使用列"Min"的最大值和非重复行"Max"的列"Max"的最小值计算的。 在上面的示例中,这个范围将是[2600.0;3041.0],因此我们只想保留带有['Meth.name] == 'DS'的行,总体结果如下:

+------------+-----+--------+-----+-------------+
| Meth.name  |  Min| Max    |Layer| Global name |
+------------+-----+--------+-----+-------------+
|   DTS      | 2600| 3041.2 | AC1 |  DTS        |
|   GGK      | 1800| 3200.0 | AC1 |  DEN        |
|   DTP      | 700 | 3041.0 | AC2 |  DT         |
|   DS       | 700 | 3041.0 | AC3 |  CALI       |
+------------+-----+--------+-----+-------------+

当然,这个问题可以通过多次迭代来解决(基于非重复行计算间隔,然后迭代地只选择那些(从重复行中)将产生最大交集的行),但是我正在尝试发现最有效的方法 谢谢


Tags: namelayerdsminglobalmaxcalidtp
2条回答

如果行的顺序不重要,可以执行以下操作:

df['diff'] = df['Max']-df['Min']
df=df.sort_values(["Global_name","diff"],ascending=True)
df.drop_duplicates('Global_name',keep='last')

来自this问题

我将这样做:

# Helper function
def calc_overlap(x):
    if min_of_max == max_of_min:
        return 0

    low = max(min_of_max, x.Min)
    high = min(max_of_min, x.Max)

    return high-low

dup_global_name = df.Global_name.value_counts()[df.Global_name.value_counts() > 1].index
dup_global_name = list(dup_global_name)

# Filter duplicates
df_dup = df[df.Global_name.isin(dup_global_name)]

# Add overlap column
df_dup['overlap'] = df_dup.apply(lambda x: calc_overlap(x), axis=1)

#Select max overlap
df_dup = df_dup.loc[df_dup.groupby('Global_name').overlap.idxmax()]

# Drop overlap col
df_dup.drop('overlap', axis=1, inplace=True)

#Concatinate with nonduplicate ones
pd.concat([df[~df.Global_name.isin(dup_global_name)], df_dup])

所需输出: enter image description here

相关问题 更多 >

    热门问题