This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
An update: I recoded the mpi_waitall as a loop over the requests with
mpi_test and a 30 second timeout. The timeout happens unpredictably,
sometimes after 10 minutes of run time, other times after 15 minutes, for
the exact same case.
After 30 seconds, I print out the status of all outstanding receive
requests. The message tags that are outstanding have definitely been
sent, so I am wondering why they are not getting received?
As I said before, everybody posts non-blocking standard receives, then
non-blocking standard sends, then calls mpi_waitall. Each process is
typically waiting on 200 to 300 requests. Is deadlock possible via this
implementation approach under some kind of unusual conditions?
> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
> returns. The case runs fine with MVAPICH. The logic associated with the
> communications has been extensively debugged in the past; we don't think
> it has errors. Each process posts non-blocking receives, non-blocking
> sends, and then does waitall on all the outstanding requests.
> The work is broken down into 960 chunks. If I run with 960 processes (60
> nodes of 16 cores each), things seem to work. If I use 160 processes
> (each process handling 6 chunks of work), then each process is handling 6
> times as much communication, and that is the case that hangs with OpenMPI
> 1.6.4; again, seems to work with MVAPICH. Is there an obvious place to
> start, diagnostically? We're using the openib btl.
> users mailing list