Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Recv hangs
From: Jorge Chiva Segura (jordic_at_[hidden])
Date: 2012-05-04 07:07:32


Why? Removing the barrier will make all the other processors advance but
the processor that is waiting for the reception will wait forever.
Moreover, in the real code there is no Barrier. I use Isend's and
Irecv's and Wait's so I don't think that the problem is the Barrier.

I have tried to add "-mca btl_openib_flags 305" and it worked ^^. Now I
am trying to understand why and which is the impact in performance.

Thank you anyway for your suggestion,
Jorge

On Fri, 2012-05-04 at 07:00 -0400, Jeff Squyres wrote:
> Try removing the barrier.
>
> On May 4, 2012, at 5:52 AM, Jorge Chiva Segura wrote:
>
> > Hi all,
> >
> > I have a program that executes a communication loop similar to this one:
> >
> > 1: for(int p1=0; p1<np; ++p1) {
> > 2: for(int p2=0; p2<np; ++p2) {
> > 3: if(me==p1) {
> > 4: if(sendSize(p2)) MPI_Ssend(sendBuffer[p2],sendSize(p2),MPI_FLOAT,p2,0,myw);
> > 5: if(recvSize(p2)) MPI_Recv(recvBuffer[p2],recvSize(p2),MPI_FLOAT,p2,0,myw,&status);
> > 6: } else if(yo==p2) {
> > 7: if(recvSize(p1)) MPI_Recv(recvBuffer[p1],recvSize(p1),MPI_FLOAT,p2,0,myw,&status);
> > 8: if(sendSize(p1)) MPI_Ssend(sendBuffer[p1],sendSize(p1),MPI_FLOAT,p2,0,myw);
> > 9: }
> > 10: MPI_Barrier(myw);
> > 11: }
> > 12: }
> >
> > The program is an iterative process that makes some calculations, communicates and then continues with the next iteration. The problem is that after making 30 successful iterations the program hangs. With padb I have seen that one of the processors waits at line 5 for the reception of data that was already sent and the rest of the processors are waiting at the barrier in line 10. The size of the messages and buffers is the same for all the iterations.
> >
> > My real program makes use of asynchronous communications for obvious performance reasons and it worked without problems when the case to solve was smaller (lower number of processors and memory), but I found that for this case the program hanged and that is why a changed the communication routine using synchronous communications to see where is the problem. Now I know where the program hangs, but I don't understand what I am doing wrong.
> >
> > Any suggestions?
> >
> > More specific data of the case and cluster:
> > Number of processors: 320
> > Max size of the message: 6800 floats (27200 bytes)
> > Number of cores by node: 32
> > File system: lustre
> > Resource manager: slurm
> > OMPI version: 1.4.4
> > Operative system: Ubuntu 10.04.4 LTS
> > Kernel: RHEL 6.2 2.6.32-220.4.2
> > Infiniband: OFED 1.4.2
> > InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
> >
> > Thank you for your time,
> > Jorge
> > --
> > Aquest missatge ha estat analitzat per MailScanner
> > a la cerca de virus i d'altres continguts perillosos,
> > i es considera que está net.
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que està net.