如何在python中为d列生成所有可能的分组

2024-10-02 20:39:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个CSV数据集,如下所示:

Access_name,AppName,identityName 
AC1,AP1,ID1 
AC1,AP1,ID2 
AC2,AP1,ID1
AC2,AP1,ID2
AC2,AP1,ID3
AC3,AP2,ID2
AC3,AP2,ID3
AC4,AP1,ID1

我想找到所有身份分配给的所有访问组合。你知道吗

例如:

AC1 - assigned to ID1, ID2 
AC2 - assigned to ID1, ID2 
AC3 - assigned to ID2, ID3 
AC4 - assigned to ID1 
AC1 and AC2 - assigned to  ID1 and ID2.
AC1 and AC3 - assigned to - ID2 
AC1 and AC4 - assigned to None 
AC1 and AC2 and AC3 - assigned to ID2 
AC1 and AC2 and AC4 - assigned to ID1 
AC1 and AC3 and AC4 - assigned to None 
AC2 and AC3 and AC4 - assigned to None

所有可能的组合都是如此。有效获取这些数据的最佳方法是什么。任何代码样本都将不胜感激。你知道吗


Tags: andcsvto数据noneid3id2id1
1条回答
网友
1楼 · 发布于 2024-10-02 20:39:20

你可以通过定义一个函数来实现。要么使用itertools迭代Access_name中所有值的组合,要么尽早切断缺少的组合。第二种方法的优点是速度更快(特别是如果您的数据集很大—有许多不同的Access_name值),但它不会生成None行。你知道吗

看起来是这样的:

def get_multi_assignments_sub(df, acc_names_selected=None, acc_names_to_observe=None):
    if acc_names_to_observe is None:
        tmp_set= set()
        df['Access_name'].map(tmp_set.update)
        acc_names_to_observe= list(tmp_set)
        acc_names_to_observe.sort()
    if acc_names_selected is None:
        acc_names_selected= list()
    # store the partial results in a list to avoid
    # unnecessary calls to pd.concat
    result_dfs= list()
    while len(acc_names_to_observe) > 0:
        # there are still values left to observe
        # the next line makes sure we observe each combination
        # only once (in an ordered fashion)
        add_selection, *acc_names_to_observe= acc_names_to_observe
        work_acc_names_selected= list(acc_names_selected)
        work_acc_names_selected.append(add_selection)
        df_sub= df.loc[df['Access_name'].map(lambda col_val: add_selection in col_val), :]
        if df_sub.shape[0] > 0:
            # the sub dataframe it is non-empty
            # that means we still have rows which contain all of
            # the values in work_acc_names_selected
            # now add this rows to the result after adding the
            # values for which we selected the rows (work_acc_names_selected)
            df_insert= df_sub.copy()
            df_insert['Access_name']= [tuple(work_acc_names_selected)] * df_insert.shape[0]
            result_dfs.append(df_insert)
            if len(acc_names_to_observe) > 0:
                # we still have some values to observe, so
                # dive deeper, adding also this to the
                # result
                result_dfs.extend(get_multi_assignments_sub(
                        df_sub.copy(), 
                        acc_names_selected=   work_acc_names_selected,
                        acc_names_to_observe= acc_names_to_observe
                    )
                )
    return result_dfs

def get_multi_assignments(df):
    # this is just a convenience function
    # that just calls the function above
    # and concats the result after everything
    # is finished
    dfs= get_multi_assignments_sub(df)
    return pd.concat(dfs, axis='index')

要应用它,只需像这样预聚合数据帧:

# the AppName is not needed
# after removing it, generate one line per identityName
# with all Access_name values that have this identityName
df_work= df.drop(['AppName'], axis='columns').drop_duplicates().groupby('identityName').agg({'Access_name': set}).reset_index()

然后使用以下方法得到结果:

df_ma= get_multi_assignments(df_work)
df_result= df_ma.groupby('Access_name').agg({'identityName': set}).reset_index()
df_result['identityName']= df_result['identityName'].map(tuple)
df_result.apply(lambda row: '{0} - assigned to {1}'.format(' and '.join(row['Access_name']), ' and '.join(row['identityName'])), axis='columns')

输出如下所示:

0             AC1 - assigned to ID1 and ID2
1     AC1 and AC2 - assigned to ID1 and ID2
2     AC1 and AC2 and AC3 - assigned to ID2
3     AC1 and AC2 and AC4 - assigned to ID1
4             AC1 and AC3 - assigned to ID2
5             AC1 and AC4 - assigned to ID1
6     AC2 - assigned to ID1 and ID2 and ID3
7     AC2 and AC3 - assigned to ID2 and ID3
8             AC2 and AC4 - assigned to ID1
9             AC3 - assigned to ID2 and ID3
10                    AC4 - assigned to ID1

使用以下输入数据:

import pandas as pd
import io

raw="""
Access_name,AppName,identityName
AC1,AP1,ID1
AC1,AP1,ID2
AC2,AP1,ID1
AC2,AP1,ID2
AC2,AP1,ID3
AC3,AP2,ID2
AC3,AP2,ID3
AC4,AP1,ID1"""

df= pd.read_csv(io.StringIO(raw), sep=',', skiprows=1)

相关问题 更多 >