我有一个pandas数据帧,其中所有丢失的值都是np.nan公司,现在我正在尝试替换这些丢失的值。我的数据的最后一列是“类”,我需要根据类对数据进行分组,然后得到该列组的平均值/中值/模式(基于数据是否为分类/连续、正常/否),并用各自的平均值/中值/模式替换该组的缺失值。在
这是我想出的代码,我知道这是一个过火。。 如果我可以:
那就太好了。在
但目前我找到了分组替换值(mean/median/mode)并存储在dict中,然后将nan元组和非nan元组分开。。正在替换nan元组中缺少的值。。并尝试将它们连接回dataframe(我还不知道该怎么做)
def fillMissing(df, dataType):
'''
Args:
df ( 2d array/ Dict):
eg : ('attribute1': [12, 24, 25] , 'attribute2': ['good', 'bad'])
dataTypes (dict): Dictionary of attribute names of df as keys and values 0/1
indicating categorical/continuous variable eg: ('attribute1':1, 'attribute2': 0)
Returns:
dataframe wih missing values filled
writes a file with missing values replaces.
'''
dataLabels = list(df.columns.values)
# the dictionary to hold the values to put in place of nan
replaceValues = {}
for eachlabel in dataLabels:
thisSer = df[eachlabel]
if dataType[eachlabel] == 1: # if its a continuous variable
_,pval = stats.normaltest(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
if(pval < 0.5):
groupMiddle = group.median() # get the median of the group
else:
groupMiddle = group.mean() # get mean (if group is normal )
innerDict[name.strip()] = groupMiddle
replaceValues[eachlabel] = innerDict
else: # if the series is continuous
# freqCount = collections.Counter(thisSer)
groupedd = thisSer.groupby(df['class'])
innerDict ={}
for name, group in groupedd:
freqC = collections.Counter(group)
mostFreq = freqC.most_common(1) # get the most frequent value of the attribute(grouped by class)
# newGroup = group.replace(np.nan , mostFreq)
innerDict[name.strip()] = mostFreq[0][0].strip()
replaceValues[eachlabel] = innerDict
print replaceValues
# replace the missing values =======================
newfile = open('missingReplaced.csv', 'w')
newdf = df
mask=False
for col in df.columns: mask = mask | df[col].isnull()
# get the dataframe of tuples that contains nulls
dfnulls = df[mask]
dfnotNulls = df[~mask]
for _, row in dfnulls.iterrows():
for colname in dataLabels:
if pd.isnull(row[colname]):
if row['class'].strip() == '>50K':
row[colname] = replaceValues[colname]['>50K']
else:
row[colname] = replaceValues[colname]['<=50K']
newfile.write(str(row[colname]) + ",")
newdf.append(row)
newfile.write("\n")
# here add newdf to dfnotNulls to get finaldf
return finaldf
如果我没弄错的话,这大部分是在documentation中,但如果你问这个问题,可能不是你要找的地方。请参阅底部关于
mode
的注释,因为它比mean
和median
稍微复杂一些。在注意,}不同,pandas将其返回为
mode()
可能不是唯一的,这与mean
和{Series
。为了解决这个问题,我只采用了最简单的方法并添加了[0]
,以便提取该系列的第一个成员。在相关问题 更多 >
编程相关推荐