给定组中所有项的笛卡尔积

2024-09-29 01:24:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我从如下数据帧开始:

       id        tof
0    43.0  1999991.0
1    43.0  2095230.0
2    43.0  4123105.0
3    43.0  5560423.0
4    46.0  2098996.0
5    46.0  2114971.0
6    46.0  4130033.0
7    46.0  4355096.0
8    82.0  2055207.0
9    82.0  2093996.0
10   82.0  4193587.0
11   90.0  2059360.0
12   90.0  2083762.0
13   90.0  2648235.0
14   90.0  4212177.0
15  103.0  1993306.0
          .
          .
          .

最终,我的目标是创建一个非常长的二维数组,其中包含具有相同id的所有项的组合(对于id为43的行):

[(1993306.0, 2105441.0), (1993306.0, 3972679.0), (1993306.0, 3992558.0), (1993306.0, 4009044.0), (2105441.0, 3972679.0), (2105441.0, 3992558.0), (2105441.0, 4009044.0), (3972679.0, 3992558.0), (3972679.0, 4009044.0), (3992558.0, 4009044.0),...]

除了把所有元组都改成数组,这样我就可以在遍历所有的id号之后转置数组。你知道吗

很自然地,我想到了itertools,我的第一个想法是用df.groupby('id')做一些事情,这样它就可以在内部将itertools应用于每个具有相同id的组,但我想,对于我拥有的百万行数据文件来说,这绝对需要永远的时间。你知道吗

有没有一个矢量化的方法可以做到这一点?你知道吗


Tags: 数据方法id目标df数据文件时间数组
3条回答

IIUC公司:

from itertools import combinations

pd.DataFrame([
    [k, c0, c1] for k, tof in df.groupby('id').tof
           for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])

      id       tof0       tof1
0   43.0  1999991.0  2095230.0
1   43.0  1999991.0  4123105.0
2   43.0  1999991.0  5560423.0
3   43.0  2095230.0  4123105.0
4   43.0  2095230.0  5560423.0
5   43.0  4123105.0  5560423.0
6   46.0  2098996.0  2114971.0
7   46.0  2098996.0  4130033.0
8   46.0  2098996.0  4355096.0
9   46.0  2114971.0  4130033.0
10  46.0  2114971.0  4355096.0
11  46.0  4130033.0  4355096.0
12  82.0  2055207.0  2093996.0
13  82.0  2055207.0  4193587.0
14  82.0  2093996.0  4193587.0
15  90.0  2059360.0  2083762.0
16  90.0  2059360.0  2648235.0
17  90.0  2059360.0  4212177.0
18  90.0  2083762.0  2648235.0
19  90.0  2083762.0  4212177.0
20  90.0  2648235.0  4212177.0

说明

这是一个列表理解,返回由数据帧构造函数包装的列表列表。Look up comprehensions to understand better.

from itertools import combinations

pd.DataFrame([
    #            name   series of tof values
    #               ↓   ↓    
    [k, c0, c1] for k, tof in df.groupby('id').tof
    #    items from combinations
    #      first    second
    #          ↓    ↓
           for c0, c1 in combinations(tof, 2)
], columns=['id', 'tof0', 'tof1'])

Groupby确实有效:

def get_product(x):
    return pd.MultiIndex.from_product((x.tof, x.tof)).values

for i, g in df.groupby('id'):
    print(i, get_product(g))

输出:

43.0 [(1999991.0, 1999991.0) (1999991.0, 2095230.0) (1999991.0, 4123105.0)
 (1999991.0, 5560423.0) (2095230.0, 1999991.0) (2095230.0, 2095230.0)
 (2095230.0, 4123105.0) (2095230.0, 5560423.0) (4123105.0, 1999991.0)
 (4123105.0, 2095230.0) (4123105.0, 4123105.0) (4123105.0, 5560423.0)
 (5560423.0, 1999991.0) (5560423.0, 2095230.0) (5560423.0, 4123105.0)
 (5560423.0, 5560423.0)]
46.0 [(2098996.0, 2098996.0) (2098996.0, 2114971.0) (2098996.0, 4130033.0)
 (2098996.0, 4355096.0) (2114971.0, 2098996.0) (2114971.0, 2114971.0)
 (2114971.0, 4130033.0) (2114971.0, 4355096.0) (4130033.0, 2098996.0)
 (4130033.0, 2114971.0) (4130033.0, 4130033.0) (4130033.0, 4355096.0)
 (4355096.0, 2098996.0) (4355096.0, 2114971.0) (4355096.0, 4130033.0)
 (4355096.0, 4355096.0)]
82.0 [(2055207.0, 2055207.0) (2055207.0, 2093996.0) (2055207.0, 4193587.0)
 (2093996.0, 2055207.0) (2093996.0, 2093996.0) (2093996.0, 4193587.0)
 (4193587.0, 2055207.0) (4193587.0, 2093996.0) (4193587.0, 4193587.0)]
90.0 [(2059360.0, 2059360.0) (2059360.0, 2083762.0) (2059360.0, 2648235.0)
 (2059360.0, 4212177.0) (2083762.0, 2059360.0) (2083762.0, 2083762.0)
 (2083762.0, 2648235.0) (2083762.0, 4212177.0) (2648235.0, 2059360.0)
 (2648235.0, 2083762.0) (2648235.0, 2648235.0) (2648235.0, 4212177.0)
 (4212177.0, 2059360.0) (4212177.0, 2083762.0) (4212177.0, 2648235.0)
 (4212177.0, 4212177.0)]
103.0 [(1993306.0, 1993306.0)]
from itertools import product
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(product(x,x))

如果希望元素不重复,可以使用

from itertools import combinations
x = df[df.id == 13].tof.values.astype(float)
all_combinations = list(combinations(x,2))

相关问题 更多 >