基于pandas中其他列值比较列值问题的回答

基于pandas中其他列值比较列值

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<h2>注意：我确实在答案的末尾写了一个函数，但是为了更好地理解，我决定一部分一部分地运行代码</h2> <hr/> <h2>获取性别模糊的名字</h2> <p>首先，你需要得到性别模糊的名字列表。我建议使用集合交集：</p> <pre><code>>>> male_names = df[df.sex == "M"].name >>> female_names = df[df.sex == "F"].name >>> gender_ambiguous_names = list(set(male_names).intersection(set(female_names))) </code></pre> <p>现在，你想把数据子集化，在2014年只显示性别不明确的名字。您需要使用成员条件，并将布尔条件作为一行程序进行链接：</p> ^{pr2}$ <hr/> <h2>聚合数据</h2> <p>现在您将其命名为<code>gender_ambiguous_data_2014</code>：</p> <pre><code>>>> gender_ambiguous_data_2014 sex year name number 0 M 2014 Seth 5 1 M 2014 Spencer 5 3 F 2014 Seth 25 4 F 2014 Spencer 23 </code></pre> <p>然后你只需按数字汇总：</p> <pre><code>>>> gender_ambiguous_data_2014.groupby('name').number.sum() name Seth 30 Spencer 28 Name: number, dtype: int64 </code></pre> <hr/> <h2>正在提取名称</h2> <p>现在，你最不希望得到的是数字最高的名字。但实际上，你可能有性别模糊的名字，它们的总数相同。我们应该将前面的结果应用到一个新变量<code>gender_ambiguous_numbers_2014</code>并使用它：</p> <pre><code>>>> gender_ambiguous_numbers_2014 = gender_ambiguous_data_2014.groupby('name').number.sum() >>> # get the max and find the list of names: >>> gender_ambiguous_max_2014 = gender_ambiguous_numbers_2014[gender_ambiguous_numbers_2014 == gender_ambiguous_numbers_2014.max()] </code></pre> <p>现在你得到这个：</p> <pre><code>>>> gender_ambiguous_max_2014 name Seth 30 Name: number, dtype: int64 </code></pre> <p>好吧，让我们提取索引名吧！在</p> <pre><code>>>> gender_ambiguous_max_2014.index Index([u'Seth'], dtype='object') </code></pre> <p>等等，这是什么类型的？（提示：它是<code>pandas.core.index.Index</code>）</p> <p>没问题，只需应用列表强制：</p> <pre><code>>>> list(gender_ambiguous_max_2014.index) ['Seth'] </code></pre> <hr/> <h2>让我们把这个写在一个函数里！在</h2> <p>所以，在本例中，我们的列表只有元素。但是也许我们想写一个函数，它返回一个字符串作为唯一的竞争者，或者返回一个字符串列表，如果一些性别不明确的名字在那一年有相同的总数。在</p> <p>在下面的包装器函数中，我用<code>ga</code>来缩短代码。当然，这是假设数据集的格式与您所显示的格式相同，并且命名为<code>df</code>。如果它是以其他方式命名的，只需相应地更改<code>df</code>。在</p> <pre><code>def get_most_popular_gender_ambiguous_name(year): """Get the gender ambiguous name with the most numbers in a certain year. Returns: a string, or a list of strings Note: 'gender_ambiguous' will be abbreviated as 'ga' """ # get the gender ambiguous names male_names = df[df.sex == "M"].name female_names = df[df.sex == "F"].name ga_names = list(set(male_names).intersection(set(female_names))) # filter by year ga_data = df[(df.name.isin(ga_names)) & (df.year == year)] # aggregate to get total numbers ga_total_numbers = ga_data.groupby('name').number.sum() # find the max number ga_max_number = ga_total_numbers.max() # subset the Series to only those that have max numbers ga_max_data = ga_total_numbers[ ga_total_numbers == ga_max_number ] # get the index (the names) for those satisfying the conditions most_popular_ga_names = list(ga_max_data.index) # list coercion # if list only contains one element, return the only element if len(most_popular_ga_names) == 1: return most_popular_ga_names[0] return most_popular_ga_names </code></pre> <p>现在，调用此函数非常简单：</p> <pre><code>>>> get_most_popular_gender_ambiguous_name(2014) # assuming df is dataframe var name 'Seth' </code></pre>

基于pandas中其他列值比较列值

1 个回答

相关Python问题