基于pandas中其他列值比较列值

2条回答

网友

1楼 · 编辑于 2024-09-26 18:21:09

注意：我确实在答案的末尾写了一个函数，但是为了更好地理解，我决定一部分一部分地运行代码

获取性别模糊的名字

首先，你需要得到性别模糊的名字列表。我建议使用集合交集：

>>> male_names = df[df.sex == "M"].name
>>> female_names = df[df.sex == "F"].name
>>> gender_ambiguous_names = list(set(male_names).intersection(set(female_names)))

现在，你想把数据子集化，在2014年只显示性别不明确的名字。您需要使用成员条件，并将布尔条件作为一行程序进行链接：

^{pr2}$

聚合数据

现在您将其命名为gender_ambiguous_data_2014：

>>> gender_ambiguous_data_2014

  sex  year     name  number
0   M  2014     Seth       5
1   M  2014  Spencer       5
3   F  2014     Seth      25
4   F  2014  Spencer      23

然后你只需按数字汇总：

>>> gender_ambiguous_data_2014.groupby('name').number.sum()

name
Seth       30
Spencer    28
Name: number, dtype: int64

正在提取名称

现在，你最不希望得到的是数字最高的名字。但实际上，你可能有性别模糊的名字，它们的总数相同。我们应该将前面的结果应用到一个新变量gender_ambiguous_numbers_2014并使用它：

>>> gender_ambiguous_numbers_2014 = gender_ambiguous_data_2014.groupby('name').number.sum()
>>> # get the max and find the list of names:
>>> gender_ambiguous_max_2014 = gender_ambiguous_numbers_2014[gender_ambiguous_numbers_2014 == gender_ambiguous_numbers_2014.max()]

现在你得到这个：

>>> gender_ambiguous_max_2014

name
Seth    30
Name: number, dtype: int64

好吧，让我们提取索引名吧！在

>>> gender_ambiguous_max_2014.index
Index([u'Seth'], dtype='object')

等等，这是什么类型的？（提示：它是pandas.core.index.Index）

没问题，只需应用列表强制：

>>> list(gender_ambiguous_max_2014.index)
['Seth']

让我们把这个写在一个函数里！在

所以，在本例中，我们的列表只有元素。但是也许我们想写一个函数，它返回一个字符串作为唯一的竞争者，或者返回一个字符串列表，如果一些性别不明确的名字在那一年有相同的总数。在

在下面的包装器函数中，我用ga来缩短代码。当然，这是假设数据集的格式与您所显示的格式相同，并且命名为df。如果它是以其他方式命名的，只需相应地更改df。在

def get_most_popular_gender_ambiguous_name(year):
    """Get the gender ambiguous name with the most numbers in a certain year.

    Returns:
        a string, or a list of strings

    Note:
        'gender_ambiguous' will be abbreviated as 'ga'
    """
    # get the gender ambiguous names
    male_names = df[df.sex == "M"].name
    female_names = df[df.sex == "F"].name
    ga_names = list(set(male_names).intersection(set(female_names)))
    # filter by year
    ga_data = df[(df.name.isin(ga_names)) & (df.year == year)]
    # aggregate to get total numbers
    ga_total_numbers = ga_data.groupby('name').number.sum()
    # find the max number
    ga_max_number = ga_total_numbers.max()
    # subset the Series to only those that have max numbers
    ga_max_data = ga_total_numbers[
        ga_total_numbers == ga_max_number
    ]
    # get the index (the names) for those satisfying the conditions
    most_popular_ga_names = list(ga_max_data.index)  # list coercion
    # if list only contains one element, return the only element
    if len(most_popular_ga_names) == 1:
        return most_popular_ga_names[0]
    return most_popular_ga_names

现在，调用此函数非常简单：

>>> get_most_popular_gender_ambiguous_name(2014)  # assuming df is dataframe var name
'Seth'

网友

2楼 · 编辑于 2024-09-26 18:21:09

不知道你说的“最性别矛盾”是什么意思，但你可以从这里开始

>>> dfy = (df.year == 2014)
>>> dfF = df[(df.sex == 'F') & dfy][['name', 'number']]
>>> dfM = df[(df.sex == 'M') & dfy][['name', 'number']]
>>> pd.merge(dfF, dfM, on=['name'])
      name  number_x  number_y
0     Seth        25         5
1  Spencer        23         5

如果您只想得到总数最高的名字，那么：

^{pr2}$

注意：我确实在答案的末尾写了一个函数，但是为了更好地理解，我决定一部分一部分地运行代码

获取性别模糊的名字

聚合数据

正在提取名称

让我们把这个写在一个函数里！在

相关问题更多 >

编程相关推荐

热门问题

热门文章