Python：将平均值设置为异常值

df = pd.read_excel("test.xlsx") grouped = df.groupby('ID') statBefore = pd.DataFrame({'q1': grouped['Value'].quantile(.25), \ 'median': grouped['Value'].median(), 'q3' : grouped['Value'].quantile(.75)}) def is_outlier(row): iq_range = statBefore.loc[row.ID]['q3'] - statBefore.loc[row.ID]['q1'] median = statBefore.loc[row.ID]['median'] q3 = statBefore.loc[row.ID]['q3'] q1 = statBefore.loc[row.ID]['q1'] if row.Value > (q3 + (3 * iq_range)) or row.Value < (q1 - (3 * iq_range)): return True else: return False #apply the function to the original df: df.loc[:, 'outlier'] = df.apply(is_outlier, axis = 1)

2条回答

网友

1楼 · 编辑于 2024-10-01 17:30:01

这就是分位数的定义

df = pd.DataFrame(np.array([60,70,80,100,130,150,200,200,250,300,800,1000]))
print df.quantile(.25)
print df.quantile(.50)
print df.quantile(.75)

（数据集的q1是95 btw）

中位数在150到200（175）之间

第一个分位数是80到100之间的四分之三（95）

第三个分位数是250到300之间的四分之一（262.5）

网友

2楼 · 编辑于 2024-10-01 17:30:01

这既简单又高效，没有Python for循环来降低速度：

s = pd.Series([30, 31, 32, 45, 50, 999]) # example data

s.where(s.between(*s.quantile([0.25, 0.75])), s.median())

它给你：

打开代码包，我们有s.quantile([0.25, 0.75])来得到：

0.25    31.25
0.75    48.75

然后，我们使用值（31.25和48.75）作为between()的参数，使用*操作符将它们解压，因为between()需要两个独立的参数，而不是长度为2的数组。这给了我们：

0    False
1    False
2     True
3     True
4    False
5    False

现在我们有了二进制掩码，我们使用s.where()来选择True位置的原始值，否则返回到s.median()。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章