Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] custom sparse collective non-reproducible deadlock, MPI_Sendrecv, MPI_Isend/MPI_Irecv or MPI_Send/MPI_Recv question
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2011-09-19 10:59:36


On 9/18/2011 9:12 AM, Evghenii Gaburov wrote:
> Hi All,
>
> Update to the original posting: METHOD4 also resulted in a deadlock on system HPC2 after 5h of run with 32 MPI tasks; also, "const int scale=1;" was missing in the code snippet posted above.
>
> --Evghenii
>
>> Message: 2
>> Date: Sun, 18 Sep 2011 02:06:33 +0000
>> From: Evghenii Gaburov<e-gaburov_at_[hidden]>
>> Subject: [OMPI users] custom sparse collective non-reproducible
>> deadlock, MPI_Sendrecv, MPI_Isend/MPI_Irecv or MPI_Send/MPI_Recv
>> question
>> To: "users_at_[hidden]"<users_at_[hidden]>
>> Message-ID:<8509050A-7357-408E-8D58-C5AEFA7B36CC_at_[hidden]>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> Hi All,
>>
>> My MPI program's basic task consists of regularly establishing point-to-point communication with other procs via MPI_Alltoall, and then to communicate data. I tested it on two HPC clusters with 32-256 MPI tasks. One of the systems (HPC1) this custom collective runs flawlessly, while on another one (HPC2) the collective causes non-reproducible deadlocks (after a day of running, or after of few hours or so). So, I want to figure out whether it is a system (HPC2) bug that I can communicate to HPC admins, or a subtle bug in my code that needs to be fixed. One possibly important point, I communicate huge amount of data between tasks (up to ~2GB of data) in several all2all calls.
>>
>> I would like to have expert eyes to look at the code to confirm or disprove that the code is deadlock-safe. I have implemented several methods (METHOD1 - METHOD4), that, if I am not mistaken, should in principle be deadlock safe. However, as a beginner MPI user, I can easily miss something subtle, as such I seek you help with this! I mostly used METHOD4 which have caused periodic deadlock, after having deadlocks with METHOD1 and METHOD2. On HPC1 none these methods deadlock in my runs. METHOD3 I am currently testing, so cannot comment on it as yet but will later; however, I will be happy to hear your comments.
>>
>> Both system use openmpi-1.4.3.
>>
>> Your answers will be of great help! Thanks!
>>
>> Cheers,
>> Evghenii
>>
>> Here is the code snippet:
>>
>> template<class T>
>> void all2all(std::vector<T> sbuf[], std::vector<T> rbuf[],
>> const int myid,
>> const int nproc)
>> {
>> static int nsend[NMAXPROC], nrecv[NMAXPROC];
>> for (int p = 0; p< nproc; p++)
>> nsend[p] = sbuf[p].size();
>> MPI_Alltoall(nsend, 1, MPI_INT, nrecv, 1, MPI_INT, MPI_COMM_WORLD); // let the other tasks know how much data they will receive from this one
>>
>> #ifdef _METHOD1_
>>
>> static MPI_Status stat[NMAXPROC ];
>> static MPI_Request req[NMAXPROC*2];
>> int nreq = 0;
>> for (int p = 0; p< nproc; p++)
>> if (p != myid)
>> {
>> const int scount = nsend[p];
>> const int rcount = nrecv[p];
>> rbuf[p].resize(rcount);
>> if (scount> 0) MPI_Isend(&sbuf[p][0], nscount, datatype<T>(), p, 1, MPI_COMM_WORLD,&req[nreq++]);
>> if (rcount> 0) MPI_Irecv(&rbuf[p][0], rcount, datatype<T>(), p, 1, MPI_COMM_WORLD,&req[nreq++]);
>> }
>> rbuf[myid] = sbuf[myid];
>> MPI_Waitall(nreq, req, stat);
Incidentally, here, you have 2*nproc requests, interlacing sends and
receives. Your array of statuses, however, is only MAXPROC big. I
think you need to declare stat[NMAXPROC*2]. Also, do you want scount in
place of nscount?