在pandas数据帧中通过对列进行分组，将连续变量中的缺失值替换为中位数/平均值，将分类变量中的缺失值替换为众数

def fillMissing(df, dataType): ''' Args: df ( 2d array/ Dict): eg : ('attribute1': [12, 24, 25] , 'attribute2': ['good', 'bad']) dataTypes (dict): Dictionary of attribute names of df as keys and values 0/1 indicating categorical/continuous variable eg: ('attribute1':1, 'attribute2': 0) Returns: dataframe wih missing values filled writes a file with missing values replaces. ''' dataLabels = list(df.columns.values) # the dictionary to hold the values to put in place of nan replaceValues = {} for eachlabel in dataLabels: thisSer = df[eachlabel] if dataType[eachlabel] == 1: # if its a continuous variable _,pval = stats.normaltest(thisSer) groupedd = thisSer.groupby(df['class']) innerDict ={} for name, group in groupedd: if(pval < 0.5): groupMiddle = group.median() # get the median of the group else: groupMiddle = group.mean() # get mean (if group is normal ) innerDict[name.strip()] = groupMiddle replaceValues[eachlabel] = innerDict else: # if the series is continuous # freqCount = collections.Counter(thisSer) groupedd = thisSer.groupby(df['class']) innerDict ={} for name, group in groupedd: freqC = collections.Counter(group) mostFreq = freqC.most_common(1) # get the most frequent value of the attribute(grouped by class) # newGroup = group.replace(np.nan , mostFreq) innerDict[name.strip()] = mostFreq[0][0].strip() replaceValues[eachlabel] = innerDict print replaceValues # replace the missing values ======================= newfile = open('missingReplaced.csv', 'w') newdf = df mask=False for col in df.columns: mask = mask | df[col].isnull() # get the dataframe of tuples that contains nulls dfnulls = df[mask] dfnotNulls = df[~mask] for _, row in dfnulls.iterrows(): for colname in dataLabels: if pd.isnull(row[colname]): if row['class'].strip() == '>50K': row[colname] = replaceValues[colname]['>50K'] else: row[colname] = replaceValues[colname]['<=50K'] newfile.write(str(row[colname]) + ",") newdf.append(row) newfile.write("\n") # here add newdf to dfnotNulls to get finaldf return finaldf

1条回答

网友

1楼 · 发布于 2024-09-30 14:21:59

如果我没弄错的话，这大部分是在documentation中，但如果你问这个问题，可能不是你要找的地方。请参阅底部关于mode的注释，因为它比mean和median稍微复杂一些。在

df = pd.DataFrame({ 'v':[1,2,2,np.nan,3,4,4,np.nan] }, index=[1,1,1,1,2,2,2,2],)

df['v_mean'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mean()))
df['v_med' ] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.median()))
df['v_mode'] = df.groupby(level=0)['v'].transform( lambda x: x.fillna(x.mode()[0]))

df
    v    v_mean  v_med  v_mode
1   1  1.000000      1       1
1   2  2.000000      2       2
1   2  2.000000      2       2
1 NaN  1.666667      2       2
2   3  3.000000      3       3
2   4  4.000000      4       4
2   4  4.000000      4       4
2 NaN  3.666667      4       4

注意，mode()可能不是唯一的，这与mean和{}不同，pandas将其返回为Series。为了解决这个问题，我只采用了最简单的方法并添加了[0]，以便提取该系列的第一个成员。在

相关问题更多 >

编程相关推荐

热门问题

热门文章