按值范围字典筛选数值列

2024-05-18 07:14:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据框和一本字典:

df = 

VARIABLE VALUE
A        3
A        4
A        60
A        5
B        1
B        2
B        3
B        100
C        0
C        1
# inclusive
accepted_ranges={
A:[3,5],
B:[1,3]
}

我想根据字典中接受的范围来清理VALUE列

df = 

VARIABLE VALUE
A        3
A        4
A        NaN
A        5
B        1
B        2
B        3
B        NaN
C        0
C        1

我尝试过:使用map(),但我似乎找不到一种通过变量组使用它的方法apply()可以工作,但是,我认为,apply()在我的数据帧(大于10万行)中工作非常慢。事先谢谢你


Tags: 数据方法mapdf字典valueinclusivenan
2条回答

使用^{}将词典映射到每个VARIABLE。然后我们使用^{}检查每个VALUE是否在范围内

最后,我们使用^{}False值转换为NaN

ranges = df['VARIABLE'].map(accepted_ranges)
df['VALUE'] = df['VALUE'].where(df['VALUE'].between(ranges.str[0], ranges.str[1]))

  VARIABLE  VALUE
0        A    3.0
1        A    4.0
2        A    NaN
3        A    5.0
4        B    1.0
5        B    2.0
6        B    3.0
7        B    NaN
8        C    0.0
9        C    1.0

快一点

.str访问器的速度可能相当慢,并且大多是后台的“循环”实现。尤其是由于数据中有大约10万行,这可能会导致代码效率降低。我们可以通过在两个字典中拆分accepted_ranges来解决这个问题,从而使用Series.map创建两个向量:

accepted_ranges1 = {k: v[0] for k, v in accepted_ranges.items()}
accepted_ranges2 = {k: v[1] for k, v in accepted_ranges.items()}

ranges1 = df['VARIABLE'].map(accepted_ranges1)
ranges2 = df['VARIABLE'].map(accepted_ranges2)

m1 = df['VALUE'].between(ranges1, ranges2)
m2 = ~df['VARIABLE'].isin(list(accepted_ranges.keys()))

df['VALUE'] = df['VALUE'].where(m1|m2)

速度比较

# create example dataframe of 10m rows
dfbig = pd.concat([df]*1000000, ignore_index=True)
dfbig.shape

# (10000000, 2)
# Erfan 1
%%timeit
ranges = dfbig['VARIABLE'].map(accepted_ranges)
dfbig['VALUE'].where(dfbig['VALUE'].between(ranges.str[0], ranges.str[1]))

10 s ± 466 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Erfan 2
%%timeit
accepted_ranges1 = {k: v[0] for k, v in accepted_ranges.items()}
accepted_ranges2 = {k: v[1] for k, v in accepted_ranges.items()}

ranges1 = dfbig['VARIABLE'].map(accepted_ranges1)
ranges2 = dfbig['VARIABLE'].map(accepted_ranges2)

dfbig['VALUE'].where(dfbig['VALUE'].between(ranges1, ranges2))

1.03 s ± 22.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# piRSquared
%%timeit
mask = [
    accepted_ranges[k][0] <= v <= accepted_ranges[k][1]
    for k, v in zip(dfbig.VARIABLE, dfbig.VALUE)
]

dfbig.VALUE.where(mask)

3.11 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

布尔掩码

mask = [
    accepted_ranges[k][0] <= v <= accepted_ranges[k][1]
    for k, v in zip(df.VARIABLE, df.VALUE)
]

df[mask]

  VARIABLE  VALUE
0        A      3
1        A      4
3        A      5
4        B      1
5        B      2
6        B      3
8        C      0
9        C      1

df.assign(VALUE=df.VALUE.where(mask))

  VARIABLE  VALUE
0        A    3.0
1        A    4.0
2        A    NaN
3        A    5.0
4        B    1.0
5        B    2.0
6        B    3.0
7        B    NaN
8        C    0.0
9        C    1.0

相关问题 更多 >