python中的无限while循环,用于计算标准偏差

2024-09-27 23:26:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我们试图删除异常值,但得到了一个无限循环

对于一个学校项目,我们(我和一个朋友)决定创建一个基于数据科学的工具。为此,我们开始清理数据库(我不会在这里导入它,因为它太大了(xlsx filecsv file))。我们现在正尝试使用“duration_minutes”列的“标准偏差*3+平均值”规则删除异常值

以下是我们用来计算标准偏差和平均值的代码:

def calculateSD(database, column):
    column = database[[column]]
    SD = column.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None)
    return SD

def calculateMean(database, column):
    column = database[[column]]
    mean = column.mean()
    return mean

我们想做以下几点:

#Now we have to remove the outliers using the code from the SD.py and SDfunction.py files
minutes = trainsData['duration_minutes'].tolist() #takes the column duration_minutes and puts it in a list
SD = int(calculateSD(trainsData, 'duration_minutes')) #calculates the SD of the column
mean = int(calculateMean(trainsData, 'duration_minutes'))
SDhigh = mean+3*SD

上面的代码计算起始值。然后我们开始一个while循环来删除异常值。删除异常值后,我们重新计算标准偏差、平均值和SDhigh。这是while循环:

while np.any(i >= SDhigh for i in minutes): #used to be >=, it doesnt matter for the outcome
    trainsData = trainsData[trainsData['duration_minutes'] < SDhigh] #used to be >=, this caused an infinite loop so I changed it to <=. Then to <
    minutes = trainsData['duration_minutes'].tolist()
    SD = int(calculateSD(trainsData, 'duration_minutes')) #calculates the SD of the column
    mean = int(calculateMean(trainsData, 'duration_minutes'))
    SDhigh = mean+3*SD
    print(SDhigh) #to see how the values changed and to confirm it is an infinite loop

输出内容如下所示:

611
652
428
354
322
308
300
296
296
296
296

它继续打印296,经过几个小时的努力,我们得出结论,我们并没有我们希望的那么聪明


TL;DR:我们正在尝试删除所有高于标准偏差*3+平均值的值,直到没有剩余值为止(我们每次都会重新计算,以检查是否仍然存在异常值)。然而,我们得到了一个无限循环


Tags: thetononeitcolumnsdmeandatabase
1条回答
网友
1楼 · 发布于 2024-09-27 23:26:12

你让事情变得比必须的更困难。计算标准偏差以去除异常值,然后重新计算等过于复杂(并且在统计上不合理)。你最好使用百分位数而不是标准差

import numpy as np
import pandas as pd

# create data
nums = np.random.normal(50, 8, 200)
df = pd.DataFrame(nums, columns=['duration'])

# set threshold based on percentiles
threshold = df['duration'].quantile(.95) * 2

# now only keep rows that are below the threshold
df = df[df['duration']<threshold]

相关问题 更多 >

    热门问题