mpi4py包中的nonblock命令“Isend”不能使计算和通信重叠

2024-09-24 02:26:31 发布

您现在位置:Python中文网/ 问答频道 /正文

亲爱的有经验的程序员,请耐心阅读我的帖子。我将衷心感谢你的帮助。你知道吗

我使用mpi4pypython包来研究非块通信命令IsendIrecv。我尝试使用这两个命令来重叠通信和后续计算,但是下面的代码显示,当使用IsendIrecv时,计算和通信之间不会发生重叠。这让我很困惑,我找不到非阻塞通信的意义。你知道吗

在代码中,创建了两个进程,分别命名为Rank0和Rank1。秩0向秩1发送矩阵Node_r,然后两个秩执行与Node_r无关的耗时计算。参数S控制Node_r的大小,S_c控制用于计算的时间。你知道吗

如果我使用Send和Recv命令。通信和计算不应该有重叠,每个秩的总时间应该是每个部分时间的总和,仿真结果正好证明了这一点。你知道吗

如果我使用IsendIrecv,那么我期望:由于这两个命令立即返回,计算过程立即开始,同时数据传输也开始了。当每个列组都完成了计算部分时,数据传输应该几乎结束,因此Test()Wait()函数不应该花费很长时间。你知道吗

然而,仿真结果表明,在非块情况下,Test()Wait()与Send和Recv占用相同的时间。换句话说,当两个列进行计算时,不能执行通信过程,反之亦然。你知道吗

我在一台超级计算机(南安普顿大学的iridis5)上试过,也在自己的工作站上试过,但不管我把一个大问题的大小S=800,S_c=1000(传递矩阵Node_r是8Gb)还是一个小问题S=10,S_c=3(Node_r是16KB)。使用IsendIrecv总是不能加速代码,也就是说,实际上没有发生重叠。我粘贴S=800,S_c=1000代码和结果:

from mpi4py import MPI
import numpy as np
import time
import random
from sys import getsizeof
from numpy import linalg as LA
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
S =800 # Control the size of matrix Node_r to be sent to Rank1
S_c=1000 #control the time consuming calculation after the data communication 
Maxite =1 # The iteration numbers for task: ``data tranformation-> 
computation'' 
Node_r = np.zeros((4*S,S,S),'float32')
Recv_r = np.zeros((4*S,S,S),'float32')
if rank ==0:
    Node_r = np.zeros((4*S,S,S),'float32')+1.0
    print('Node is of ', getsizeof(Node_r)*1e-3, 'KB')
comm.Barrier()


# Synchronous communication code:
if rank == 0:
    START_rank0 = time.time()
    for i in range(Maxite):
        s_rank0 = time.time()
        req_r = comm.Send( [Node_r, MPI.FLOAT], dest =1, tag =11)
        e1_rank0 = time.time()
        for ii in range(2):
            A = np.random.rand(S_c,S_c)
            tt = LA.eig(A)
        e2_rank0 = time.time()
    END_rank0 = time.time()   
    SendList = [START_rank0,END_rank0,s_rank0,e1_rank0,e2_rank0]     
if rank == 1 :
    START_rank1 = time.time()
    for i in range(Maxite): 
        s_rank1 = time.time()
        recv_r = comm.Irecv([Recv_r, MPI.FLOAT], source=0,tag=11)    
        recv_r.Wait()
        e1_rank1 = time.time()
        for ii in range(2):
            A = np.random.rand(S_c,S_c)
            tt = LA.eig(A)    
        e2_rank1 = time.time()    
    END_rank1 = time.time()   
    RecvList = [START_rank1,END_rank1,s_rank1,e1_rank1,e2_rank1] 
comm.Barrier()

 # Asynchronous communication code:
if rank == 0:
    START2_rank0 = time.time()
    for i in range(Maxite):
        ss_rank0 = time.time()
        req_r = comm.Isend( [Node_r, MPI.FLOAT], dest =1, tag =11)
        req_r.Test()
        ee1_rank0 = time.time()
        for ii in range(2):
            A = np.random.rand(S_c,S_c)
            tt = LA.eig(A)
            req_r.Test()
        ee2_rank0 = time.time()
        if req_r.Test()==False:
            #print('Rank0 Test is False')
            req_r.Wait() 
        ee3_rank0 = time.time() 
    END2_rank0 = time.time()  
    IsendList = [START2_rank0,END2_rank0,ss_rank0,ee1_rank0,ee2_rank0, ee3_rank0]    
if rank == 1 :
    START2_rank1 = time.time()
    for i in range(Maxite): 
        ss_rank1 = time.time()
        recv_r = comm.Irecv([Recv_r, MPI.FLOAT], source=0,tag=11)    
        recv_r.Test()
        ee1_rank1 = time.time()
        for ii in range(2):
            A = np.random.rand(S_c,S_c)
            tt = LA.eig(A)
            recv_r.Test()
        ee2_rank1 = time.time() 
        if recv_r.Test()==False:  
            #print('Rank1 Test is False')
            recv_r.Wait() 
        ee3_rank1 = time.time()
    END2_rank1 = time.time()  
    IrecvList =[START2_rank1,END2_rank1,ss_rank1,ee1_rank1,ee2_rank1,ee3_rank1]
comm.Barrier()
if rank==0:
    print('Send: rank0 Duration',SendList[1]-SendList[0])    
    print('Send: rank0 e1-s Send:',SendList[3]-SendList[2])
    print('Send: rank0 e2-e1 Computation:',SendList[4]-SendList[3]) 
    print('Isend rank0 Duration',IsendList[1]-IsendList[0]) 
    print('Isend rank0 ee1-ss Isend:',IsendList[3]-IsendList[2])
    print('Isend rank0 ee2-ee1 Computation:',IsendList[4]-IsendList[3])
    print('Isend rank0 ee3-ee2 Test/Wait:',IsendList[5]-IsendList[4])    

if rank==1:
    time.sleep(3)
    print('#####################################')
    print('Recv: rank1 Duration',RecvList[1]-RecvList[0])    
    print('Recv: rank1 e1-s Recv:',RecvList[3]-RecvList[2])
    print('Recv: rank1 e2-e1 Computation:',RecvList[4]-RecvList[3])     
    print('Irecv rank1 Duration',IrecvList[1]-IrecvList[0]) 
    print('Irecv rank1 ee1-ss Isend:',IrecvList[3]-IrecvList[2])
    print('Irecv rank1 ee2-ee1 Computation:',IrecvList[4]-IrecvList[3])
    print('Irecv rank1 ee3-ee2 Test/Wait:',IrecvList[5]-IrecvList[4])    

结果是:

('Node is of ', 8192000.1280000005, 'KB')

('Send: rank0 Duration', 17.594741106033325)
('Send: rank0 e1-s Send:', 5.696147918701172)
('Send: rank0 e2-e1 Computation:', 11.898576974868774)
('Isend rank0 Duration', 17.708693027496338)
('Isend rank0 ee1-ss Isend:', 0.00017595291137695312)
('Isend rank0 ee2-ee1 Computation:', 11.87893009185791)
('Isend rank0 ee3-ee2 Test/Wait:', 5.8295629024505615)

('Recv: rank1 Duration', 17.276994943618774)
('Recv: rank1 e1-s Recv:', 5.695960998535156)
('Recv: rank1 e2-e1 Computation:', 11.581011056900024)
('Irecv rank1 Duration', 17.708734035491943)
('Irecv rank1 ee1-ss Isend:', 3.814697265625e-05)
('Irecv rank1 ee2-ee1 Computation:', 17.708672046661377)
('Irecv rank1 ee3-ee2 Test/Wait:', 1.6927719116210938e-05)

可以看出,无论执行Test()还是Wait(),通信时间总是大约5.5s,计算时间大约11.5s,因此总时间总是17s

有没有办法减少Isend/Irecv情况下的总时间,使其小于17秒?你知道吗

我考虑过可能的原因是Node_r太大了,以至于Isend和Irecv必须以同步的方式执行,如论文Understanding the Behavior and Performance of Non-blocking Communications in MPI中所描述的那样,但是我发现即使在很小的情况下,重叠也不明显。你知道吗

我很感激你的各种建议。你知道吗


Tags: testsendnodetimecommprintwaitrecv