基于标题列表创建聚合列问题的回答

基于标题列表创建聚合列

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个包含调查数据的数据框。它还包含其他几个列，其中包括人口统计数据（如年龄、部门等）和带有评级的列。希望根据评级列的计算向数据框添加一些列 添加列的目的是提供a）获得有利响应的计数b）获得有利响应的百分比（有利响应的数量/该系数中项目的数量）c）获得有利响应的系数级百分比（如果存在属于该系数的具有NaN的任何项目，则为NaN）下表显示了如何将其应用于指导因素的示例我想将这一点推广到其他因素，如多样性、领导力和参与度 <pre><code>Coach_q1 Coach_q2 Coach_q8 coach_favcount coach_fav_perc coach_agg_perc Favourable Neutral Favourable 2 66.6% 66.6% Favourable Favourable NaN 2 100% NaN Favourable Favourable Unfavourable 2 66.6% 66.6% NaN NaN Unfavourable 0 0% NaN </code></pre> 我已经使用了下面的代码，它是有效的，但是，我只能得到fav_count列和fav_perc列用于指导。希望a）获得_agg_perc列，b）将其应用于所有其他因素 <pre><code>#Get the Coaching Columns coaching_agg = df.loc[:, df.columns.str.contains('Coaching_')] #Create a column to store the number of favourable responses df['coaching_fav_count'] = df[coaching_cols == 'Favourable'].notna().sum(axis=1) #create a column to store the percentage of favourable responses df['coaching_fav_perc'] = df['coaching_fav'] / len(coaching_agg.columns) </code></pre> 我猜for循环背后的逻辑是a）创建一个评级列列表（见下面的代码），b）创建一个函数来计算计数、有利响应的百分比，在项目级别查找NaN的存在，以及c）创建一个for循环来将该函数应用于评级列 <pre><code>#Create a list made up of rating cols ratingcollist = ['Coaching_','Communication_','Development_','Diversity_','Engagement_'] ratingcols = df.loc[:, df.columns.str.contains('|'.join(ratingcollist))] </code></pre> 感谢任何形式的帮助，我可以得到，谢谢你

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

我相信您需要分别处理列表的每个值： <pre><code>df = pd.DataFrame({'Coach_q1': ['Favourable', 'Favourable', 'Favourable', 'nan'], 'Coach_q2': ['Neutral', 'Favourable', 'Favourable', 'NaN'], 'Coach_q8': ['Favourable', 'nan', 'Unfavourable', 'Unfavourable']}) print (df) Coach_q1 Coach_q2 Coach_q8 0 Favourable Neutral Favourable 1 Favourable Favourable nan 2 Favourable Favourable Unfavourable 3 nan NaN Unfavourable #replace nan and NaN strings to missing values df = df.replace(['nan','NaN'], np.nan) ratingcollist = ['Coach_','Communication_','Development_','Diversity_','Engagement_'] for rat in ratingcollist: #filter columns by substrings cols = df.filter(like=rat).columns #mask for no missing values mask = df[cols].notna().all(axis=1) #create new columns if match if len(cols) > 0: df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1) df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].count(axis=1) df.loc[mask, f'{rat.lower()}agg_perc'] = df.loc[mask, f'{rat.lower()}fav_count'] / len(cols) </code></pre> <hr/> <pre><code>print (df) Coach_q1 Coach_q2 Coach_q8 coach_fav_count coach_fav_perc \ 0 Favourable Neutral Favourable 2 0.666667 1 Favourable Favourable NaN 2 1.000000 2 Favourable Favourable Unfavourable 2 0.666667 3 NaN NaN Unfavourable 0 0.000000 coach_agg_perc 0 0.666667 1 NaN 2 0.666667 3 NaN </code></pre> <hr/> 如果将<code>nan</code>替换为<code>fav_perc</code>的word missing输出是错误的，则第二个值应为<code>1</code>，因为count排除missing值： <pre><code>df = pd.DataFrame({'Coach_q1': ['Favourable', 'Favourable', 'Favourable', 'nan'], 'Coach_q2': ['Neutral', 'Favourable', 'Favourable', 'NaN'], 'Coach_q8': ['Favourable', 'nan', 'Unfavourable', 'Unfavourable']}) print (df) Coach_q1 Coach_q2 Coach_q8 0 Favourable Neutral Favourable 1 Favourable Favourable nan 2 Favourable Favourable Unfavourable 3 nan NaN Unfavourable df = df.replace(['nan','NaN'], 'Missing') print (df) Coach_q1 Coach_q2 Coach_q8 0 Favourable Neutral Favourable 1 Favourable Favourable Missing 2 Favourable Favourable Unfavourable 3 Missing Missing Unfavourable </code></pre> <hr/> <pre><code>#create a list of all the rating columns ratingcollist = ['Coach_','Diversity', 'Leadership', 'Engagement'] #create a for loop to get all the columns that match the column list keyword for rat in ratingcollist: cols = df.filter(like=rat).columns mask = (df[cols] != 'Missing').all(axis=1) #create 3 new columns for each factor, one for count of Favourable responses, #one for percentage of Favourable responses, and one for Factor Level percentage of Favourable responses if len(cols) > 0: df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1) df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].count(axis=1) df.loc[mask,f'{rat.lower()}agg_perc'] = df.loc[mask, f'{rat.lower()}fav_count'] / len(cols) </code></pre> <hr/> <pre><code>print (df) Coach_q1 Coach_q2 Coach_q8 coach_fav_count coach_fav_perc \ 0 Favourable Neutral Favourable 2 0.666667 1 Favourable Favourable Missing 2 0.666667 2 Favourable Favourable Unfavourable 2 0.666667 3 Missing Missing Unfavourable 0 0.000000 coach_agg_perc 0 0.666667 1 NaN 2 0.666667 3 NaN </code></pre> 因此，如果想要使用<code>Missing</code>是必要的，请将<code>count</code>更改为<code>sum</code>与compare not equal <code>Missing</code>： <pre><code>#create a list of all the rating columns ratingcollist = ['Coach_','Diversity', 'Leadership', 'Engagement'] #create a for loop to get all the columns that match the column list keyword for rat in ratingcollist: cols = df.filter(like=rat).columns mask = (df[cols] != 'Missing').all(axis=1) #create 3 new columns for each factor, one for count of Favourable responses, #one for percentage of Favourable responses, and one for Factor Level percentage of Favourable responses if len(cols) > 0: df[f'{rat.lower()}fav_count'] = (df[cols] == 'Favourable').sum(axis=1) df[f'{rat.lower()}fav_perc'] = df[f'{rat.lower()}fav_count'] / df[cols].ne('Missing').sum(axis=1) df.loc[mask,f'{rat.lower()}agg_perc'] = df.loc[mask, f'{rat.lower()}fav_count'] / len(cols) </code></pre> <hr/> <pre><code>print (df) Coach_q1 Coach_q2 Coach_q8 coach_fav_count coach_fav_perc \ 0 Favourable Neutral Favourable 2 0.666667 1 Favourable Favourable Missing 2 1.000000 2 Favourable Favourable Unfavourable 2 0.666667 3 Missing Missing Unfavourable 0 0.000000 coach_agg_perc 0 0.666667 1 NaN 2 0.666667 3 NaN </code></pre>

基于标题列表创建聚合列

1 个回答

相关Python问题