根据Python中事件的时间创建概率表

2024-09-02 21:29:29 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个大学项目的数据集,是在对数据进行了一些处理后得到的:

df = d = pd.DataFrame({
'duplicates': [
     [('007', "us1", "us2", "time1", 'time2', 4)],
     [('008', "us1", "us2", "time1", 'time2', 5)],
     [('009', "us1", "us2", "time1", 'time2', 6)],
     [('007', 'us2', "us3", "time1", 'time2', 4)],
     [('008', 'us2', "us3", "time1", 'time2', 7)], 
     [('009', 'us2', "us3", "time1", 'time2', 11)], 
     [('001', 'us5', 'us1', "time1", 'time2', 0)], 
     [('008', 'us5', 'us1', "time1", 'time2', 19)], 
     [('007',"us3", "us2", "time1", 'time2', 2)],
     [('007',"us3", "us2", "time1", 'time2', 34)],
     [('009',"us3", "us2", "time1", 'time2', 67)]],
'numberOfInteractions': [1, 2, 3, 4, 5, 6, 7, 8, 1, 1, 11]
   })

enter image description here

“duplicates”是一个元组:(ID, USER1, USER2, TIME USER1, TIME USER2, DELAY BETWEEN TIMES)

现在我必须创建一个概率表user x user,这是我通过计算交互来完成的,因此对于us2列,我们有(1+2+3)/19,Na/19,(11+1+1)/19。在这种情况下1 + 2 + 3是数据上(df[us1,us2])之间的numberOfInteractions(第一张图片上的第0行到第2行)

enter image description here

代码如下:

    df['duplicates'] = df.apply(
            lambda x: [(x['numberOfInteractions'],a, b, c, d, e,f) for a, b, c, d, e, f in x.duplicates], 1)


df =(pd.DataFrame(df["duplicates"].explode().tolist(),
                  columns=["numberOfInteractions", "ID","USER1","USER2","TAU1","TAU2","DELAY"])
     .groupby(["USER1","USER2"])["numberOfInteractions"]
     .agg(sum).to_frame().unstack())


df.columns = df.columns.get_level_values(1)
combined = df.index|df.columns
for col in combined:
    if col not in df.columns:
        df[col] = np.nan
    df[col] = df[col] / df[col].sum(skipna=True)

这里的问题是,我想要一个基于元组最后一部分的概率(两次之间的延迟)

例如,'us5', 'us1'有两个交互,一个是延迟19,另一个是延迟0(第一张图中的第6行和第7行),因此我想在一个元组上有这个概率,比如(less than 5, less than 19, less than 60, less than 80, less than 98),所以在这种情况下,df['us5','us1'],它将是:(7/15,8/15,0/15,0/15),而不是今天的1(因为我的算法是加(8+7)/15,所以是1)

这是我的想法,但我甚至不知道如何开始


Tags: columnsdfcollessduplicatesthanuser1user2
1条回答
网友
1楼 · 发布于 2024-09-02 21:29:29

我想你有两条路要走

你可以根据延迟和交互次数(我会做的)选择一个新专栏:

def mapToNbOfInteractionsPerDelay(group):
    nbOfInteractions = group['numberOfInteractions']
    delay = group['DELAY']

    if(delay <= 5):
        return (nbOfInteractions, 0, 0, 0, 0)
    elif(delay <= 19):
        return (0, nbOfInteractions, 0, 0, 0)
    elif(delay <= 60):
        return (0, 0, nbOfInteractions, 0, 0)
    elif(delay <= 80):
        return (0, 0, 0, nbOfInteractions, 0)
    else:
        return (0, 0, 0, 0, nbOfInteractions)


df["nbOfInteractionsPerDelay"] = df[["DELAY", "numberOfInteractions"]].apply(mapToNbOfInteractionsPerDelay, axis=1)

然后你可以选择:

df = (df.groupby(["USER1","USER2"])["nbOfInteractionsPerDelay"]
        .agg(lambda l : tuple([sum(x) for x in zip(*l)])).to_frame().unstack())

这将为您提供以下信息:

      nbOfInteractionsPerDelay                                    
USER2                      us1               us2               us3
USER1                                                            
us1                        NaN   (3, 3, 0, 0, 0)               NaN
us2                        NaN               NaN  (4, 11, 0, 0, 0)
us3                        NaN  (1, 0, 1, 11, 0)               NaN
us5            (7, 8, 0, 0, 0)               NaN               NaN

从那里,你可以很容易地得到你想要的

或者将数据帧拆分为5个其他数据帧,每个数据帧具有特定延迟子集的值,然后合并

相关问题 更多 >