Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Application hangs on mpi_waitall
From: eblosch_at_[hidden]
Date: 2013-06-25 22:02:26


An update: I recoded the mpi_waitall as a loop over the requests with
mpi_test and a 30 second timeout. The timeout happens unpredictably,
sometimes after 10 minutes of run time, other times after 15 minutes, for
the exact same case.

After 30 seconds, I print out the status of all outstanding receive
requests. The message tags that are outstanding have definitely been
sent, so I am wondering why they are not getting received?

As I said before, everybody posts non-blocking standard receives, then
non-blocking standard sends, then calls mpi_waitall. Each process is
typically waiting on 200 to 300 requests. Is deadlock possible via this
implementation approach under some kind of unusual conditions?

Thanks again,

Ed

> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
> returns. The case runs fine with MVAPICH. The logic associated with the
> communications has been extensively debugged in the past; we don't think
> it has errors. Each process posts non-blocking receives, non-blocking
> sends, and then does waitall on all the outstanding requests.
>
> The work is broken down into 960 chunks. If I run with 960 processes (60
> nodes of 16 cores each), things seem to work. If I use 160 processes
> (each process handling 6 chunks of work), then each process is handling 6
> times as much communication, and that is the case that hangs with OpenMPI
> 1.6.4; again, seems to work with MVAPICH. Is there an obvious place to
> start, diagnostically? We're using the openib btl.
>
> Thanks,
>
> Ed
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users