Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Application hangs on mpi_waitall
From: George Bosilca (bosilca_at_[hidden])
Date: 2013-06-27 11:38:24


If I understand correctly the communication parroter is a one-to-all type of communication isn't it (from your server to your clients)? In this case this might be a credit management issue, where the master is running out of ack buffers and the clients can't acknowledge the retrieval of the data.

Let's try to add "--mca btl_openib_flags 9" to the mpirun command (this disable the RMA communication and forces everything to have a pure send/recv semantics).

  George.

On Jun 27, 2013, at 15:01 , Ed Blosch <eblosch_at_[hidden]> wrote:

> It ran a bit longer but still deadlocked. All matching sends are posted 1:1with posted recvs so it is a delivery issue of some kind. I'm running a debug compiled version tonight to see what that might turn up. I may try to rewrite with blocking sends and see if that works. I can also try adding a barrier (irecvs, barrier, isends, waitall) to make sure sends are not buffering waiting for recvs to be posted.
>
>
> Sent via the Samsung Galaxy S™ III, an AT&T 4G LTE smartphone
>
>
>
> -------- Original message --------
> From: George Bosilca <bosilca_at_[hidden]>
> Date:
> To: Open MPI Users <users_at_[hidden]>
> Subject: Re: [OMPI users] Application hangs on mpi_waitall
>
>
> Ed,
>
> Im not sure but there might be a case where the BTL is getting overwhelmed by the nob-blocking operations while trying to setup the connection. There is a simple test for this. Add an MPI_Alltoall with a reasonable size (100k) before you start posting the non-blocking receives, and let's see if this solves your issue.
>
> George.
>
>
> On Jun 26, 2013, at 04:02 , eblosch_at_[hidden] wrote:
>
> > An update: I recoded the mpi_waitall as a loop over the requests with
> > mpi_test and a 30 second timeout. The timeout happens unpredictably,
> > sometimes after 10 minutes of run time, other times after 15 minutes, for
> > the exact same case.
> >
> > After 30 seconds, I print out the status of all outstanding receive
> > requests. The message tags that are outstanding have definitely been
> > sent, so I am wondering why they are not getting received?
> >
> > As I said before, everybody posts non-blocking standard receives, then
> > non-blocking standard sends, then calls mpi_waitall. Each process is
> > typically waiting on 200 to 300 requests. Is deadlock possible via this
> > implementation approach under some kind of unusual conditions?
> >
> > Thanks again,
> >
> > Ed
> >
> >> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
> >> returns. The case runs fine with MVAPICH. The logic associated with the
> >> communications has been extensively debugged in the past; we don't think
> >> it has errors. Each process posts non-blocking receives, non-blocking
> >> sends, and then does waitall on all the outstanding requests.
> >>
> >> The work is broken down into 960 chunks. If I run with 960 processes (60
> >> nodes of 16 cores each), things seem to work. If I use 160 processes
> >> (each process handling 6 chunks of work), then each process is handling 6
> >> times as much communication, and that is the case that hangs with OpenMPI
> >> 1.6.4; again, seems to work with MVAPICH. Is there an obvious place to
> >> start, diagnostically? We're using the openib btl.
> >>
> >> Thanks,
> >>
> >> Ed
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users