数据帧列中出现计数模式

sequence_of_selection 0 Accept,Reject,Reject,Reject,Reject,Accept,Reje... 1 Accept,Reject,Reject,Reject,Reject,Reject,Reje... 2 Reject,Accept,Accept,Reject,Reject,Reject,Acce... 3 Accept,Reject,Accept,Accept,Accept,Accept,Reje... 4 Reject,Accept,Reject,Accept,Reject,Reject,Acce...

2条回答

网友

1楼 · 编辑于 2024-10-01 05:04:37

这是一个非常好的问题我想强调事件的时间间隔的力量——基于发布的序列，人们对行为和可预测性有很多洞察。考虑到这一点，我写了一个很长的答案，希望能解释一些数据操作的核心原则

1。创建自定义函数以执行计算：
（假设您只应用于一个列表–我建议在调试或测试时提取一个列表）

def event_metrics(my_list, look_for = "Accept", exclude_zeros=True, simple=True):
    """
    Simple mode:
        Returns the average number of `items` before `look_for` 

    Non-Simple mode:
        Returns a dictionary with the mean, median, and max number of `items`
        before `look_for` 

     
    my_list: a list of values

    look_for: An item in the list which constitutes the "event"
              Example: "accept" from a list of "accept" and "reject"

    exclude_zeros: exclude metrics for when `look_for` occurs back to back

    simple: operate in simple mode or non-simple mode

    """

    # Instantiate a counter list
    my_counter = []
    n = 0

    # Loop through the list
    for x in my_list:

        # If a match, add n to the list and reset
        if x==look_for:
            my_counter.append(n)
            n=0

        # Otherwise, continue
        else:
            n+=1

    # Sometimes you might want to append the final n at conclusion of the loop
    # You could do that with the following code:
    # if x!=look_for:
    #     my_counter.append(n)


    # You may not want to include back-to-back events
    if exclude_zeros:
        my_counter = [x for x in my_counter if x>0]

    # You can return a specific metric such as mean
    if simple:
        return np.mean(my_counter)

    # Or you can pass several metrics as a dictionary and convert to a series
    my_metrics = {
        "mean":np.mean(my_counter),
        "median":np.median(my_counter),
        "max":np.max(my_counter)
    }
    return my_metrics

2。将此自定义函数应用于df:

简单模式：返回单个值的数组–将其视为新列。
非简单模式：返回字典数组–使用pd.to_Series转换为多列。使用pd.merge添加到原始df

# Simple Mode
df["sequence_of_selection"].apply(event_metrics, simple=True)

# Non-Simple Mode
temp_df = df["sequence_of_selection"].apply(event_metrics, simple=False)\
            .apply(pd.Series)\  # Convert to its own df
            .add_prefix("rej_") # Add a prefix to your column names

df.merge(temp_df,left_index=True,right_index=True)

网友

2楼 · 编辑于 2024-10-01 05:04:37

因为它们是列表，所以可以获取'Accept'的index，然后取这些索引的平均值。如果索引为0，则列表中的第一项为'Accept'，因此在它之前有零'Reject'，依此类推

df['sequence_of_selection'].apply(lambda x: x.index('Accept')).mean()

相关问题更多 >

编程相关推荐

热门问题

热门文章