数据帧列中出现计数模式

2024-10-01 05:04:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个带有列的数据框,其中每行包含一个列表。我想知道从列表列中识别模式最有效的方法/最佳实践是什么——例如,接受前的平均拒绝数。(见下面的示例)

    sequence_of_selection
0   Accept,Reject,Reject,Reject,Reject,Accept,Reje...
1   Accept,Reject,Reject,Reject,Reject,Reject,Reje...
2   Reject,Accept,Accept,Reject,Reject,Reject,Acce...
3   Accept,Reject,Accept,Accept,Accept,Accept,Reje...
4   Reject,Accept,Reject,Accept,Reject,Reject,Acce...

我可以将数据转换为字符串并拆分它们,或者在字符串中搜索子字符串等等,但是我更愿意寻找一种更有效的方法,因为Python字符串是不可变的

任何建议/帮助都将不胜感激

Jupyter screenshot


Tags: of数据方法字符串示例列表jupyter建议
2条回答

这是一个非常好的问题我想强调事件的时间间隔的力量——基于发布的序列,人们对行为和可预测性有很多洞察。考虑到这一点,我写了一个很长的答案,希望能解释一些数据操作的核心原则

1。创建自定义函数以执行计算:
(假设您只应用于一个列表–我建议在调试或测试时提取一个列表)

def event_metrics(my_list, look_for = "Accept", exclude_zeros=True, simple=True):
    """
    Simple mode:
        Returns the average number of `items` before `look_for` 

    Non-Simple mode:
        Returns a dictionary with the mean, median, and max number of `items`
        before `look_for` 

     
    my_list: a list of values

    look_for: An item in the list which constitutes the "event"
              Example: "accept" from a list of "accept" and "reject"

    exclude_zeros: exclude metrics for when `look_for` occurs back to back

    simple: operate in simple mode or non-simple mode

    """

    # Instantiate a counter list
    my_counter = []
    n = 0

    # Loop through the list
    for x in my_list:

        # If a match, add n to the list and reset
        if x==look_for:
            my_counter.append(n)
            n=0

        # Otherwise, continue
        else:
            n+=1

    # Sometimes you might want to append the final n at conclusion of the loop
    # You could do that with the following code:
    # if x!=look_for:
    #     my_counter.append(n)


    # You may not want to include back-to-back events
    if exclude_zeros:
        my_counter = [x for x in my_counter if x>0]

    # You can return a specific metric such as mean
    if simple:
        return np.mean(my_counter)

    # Or you can pass several metrics as a dictionary and convert to a series
    my_metrics = {
        "mean":np.mean(my_counter),
        "median":np.median(my_counter),
        "max":np.max(my_counter)
    }
    return my_metrics

2。将此自定义函数应用于df:

  • 简单模式:返回单个值的数组–将其视为新列。
  • 非简单模式:返回字典数组–使用pd.to_Series转换为多列。使用pd.merge添加到原始df
# Simple Mode
df["sequence_of_selection"].apply(event_metrics, simple=True)

# Non-Simple Mode
temp_df = df["sequence_of_selection"].apply(event_metrics, simple=False)\
            .apply(pd.Series)\  # Convert to its own df
            .add_prefix("rej_") # Add a prefix to your column names

df.merge(temp_df,left_index=True,right_index=True)

因为它们是列表,所以可以获取'Accept'index,然后取这些索引的平均值。如果索引为0,则列表中的第一项为'Accept',因此在它之前有零'Reject',依此类推

df['sequence_of_selection'].apply(lambda x: x.index('Accept')).mean()

相关问题 更多 >