将两个NumPy数组分组到一个列表中

2024-10-04 03:16:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我有两个大的NumPy阵列,每个阵列的形状为(519990,),看起来像这样:

Order = array([0, 0, 0, 5, 6, 10, 14, 14, 14, 23, 23, 39]) 
Letters = array([A, B, C, D, E, F, G, H, I, J, K, L])

如您所见,第一个数组始终为升序和正数。我想将信件中的所有内容按如下顺序分组:

{0:[A,B,C], 5:[D], 6:[E], 10:[F], 14:[G, H, I], 23:[J, K], 39:[L]}

我必须这样做的代码是:

df = pd.DataFrame()
df['order'] = Order
df['letters'] = Letters

linearDict = df.grouby('order').apply(lambda dfg:dfg.drop('order', axis=1).to_dict(orient='list')).to_dict()

endProduct = {}
for k, v in linearDict.items():
     endProduct[k] = np.array(linearDict[k]['letter'][0:])

enProduct = {0:array([A,B,C]), 5:array([D]), 6:array([E]), 10:array([F]), 14:array([G, H, I]), 23:array([J, K]), 39:array([L])}

我的问题是这个过程太慢了。这对系统来说是一个巨大的消耗,它导致我的Jupyter笔记本崩溃。有没有更快的方法


Tags: tonumpydforder数组arraydict形状
3条回答

试试这个:

grp = np.cumsum(np.unique(Order, return_counts=True)[1])
arr = np.stack(np.split(Letters, grp)[:-1])
{n: k for n, k in enumerate(arr.tolist())}

输出:

{0: ['A', 'B', 'C'],
 1: ['D', 'E', 'F'],
 2: ['G', 'H', 'I'],
 3: ['J', 'K', 'L']}

使用:

data = df.groupby('order')['letters'].agg(list).to_dict()

我们可以通过将sort=False和agg传递给tuple而不是list来进一步提高性能:

data = df.groupby('order', sort=False)['letters'].agg(tuple).to_dict()

结果:

# print(data)
{0: ['A', 'B', 'C'], 1: ['D', 'E', 'F'], 2: ['G', 'H', 'I'], 3: ['J', 'K', 'L']}

timeit绩效结果:

df.shape    
(1200000, 2)

o = np.repeat([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3], 100000)
l = np.repeat([A, B, C, D, E, F, G, H, I, J, K, L], 100000)

***Fastest answer***
%%timeit -n10 @Divakar
idx = np.flatnonzero(np.r_[True,o[:-1]!=o[1:],True])
{o[i]:l[i:j] for (i,j) in zip(idx[:-1],idx[1:])}
1.44 ms ± 243 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
*******************

%%timeit -n10 @Scott
grp = np.cumsum(np.unique(o, return_counts=True)[1])
arr = np.stack(np.split(l, grp)[:-1])
{n: k for n, k in enumerate(arr.tolist())}
38.5 ms ± 699 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n10 @ shubham 2
data = df.groupby('order', sort=False)['letters'].agg(tuple).to_dict()
118 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n10 @shubham 1
data = df.groupby('order')['letters'].agg(list).to_dict()
177 ms ± 4.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n10 @anky 1
d = (dict([*chain(*map(dict.items,[{k:[*zip(*g)][1] } 
     for k,g in groupby(zip(o,l),itemgetter(0))]))]))
636 ms ± 23.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n10 @ anky 2
_ = dict([(k,list(zip(*g))[1]) for k,g in groupby(zip(o,l),itemgetter(0))])
659 ms ± 36.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n10 @Ch3ster
new = defaultdict(list)
for k,v in zip(o, l):
    new[k].append(v)
602 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

我们可以利用Order被排序的事实,在得到区间索引后,简单地切片Letters,如下所示-

def numpy_slice(Order, Letters):
    Order = np.asarray(Order)
    Letters = np.asarray(Letters)
    idx = np.flatnonzero(np.r_[True,Order[:-1]!=Order[1:],True])
    return {Order[i]:Letters[i:j] for (i,j) in zip(idx[:-1],idx[1:])}

样本运行-

In [66]: Order
Out[66]: array([16, 16, 16, 16, 23, 30, 33, 33, 39, 39, 39, 39, 39, 39, 39])

In [67]: Letters
Out[67]: 
array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
       'N', 'O'], dtype='<U1')

In [68]: numpy_slice(Order, Letters)
Out[68]: 
{16: array(['A', 'B', 'C', 'D'], dtype='<U1'),
 23: array(['E'], dtype='<U1'),
 30: array(['F'], dtype='<U1'),
 33: array(['G', 'H'], dtype='<U1'),
 39: array(['I', 'J', 'K', 'L', 'M', 'N', 'O'], dtype='<U1')}

相关问题 更多 >