I'm trying to figure out what the limitation is for the number of
pending nonblocking operations as it does not seem to be specified
anywhere. I apologize if this is better suited to the user list, but
this seemed like information more likely to be available on the dev list.
As part of a toy assignment involving multiplying triangular square
matrices, one solution being compared sends each row and column
individually. On matrices of 100 and 1000 rows the program functions
fine. However with 5000 rows it functions correctly with 8 processes
spread across 4 or 2 nodes, but not on a single node, similarly for 4
processes it works on 2 nodes, but not one, and for 2 processes on 1
node it fails. The failure appears to be because there are some number
(at least 2500) of receives that never complete causing a waitany to
never return. No errors are produced from the MPI_Isends, nor from the
MPI_Irecv's nor the MPI_Waitany.
As it works on multiple nodes, but not one node, it seems reasonable to
believe that the problem lies with there being too many nonblocking
operations in progress, as there are a total of around 18000 pending
operations at once if all the processes are run on one node.
The standard says the following, but I can't seem to find a definition
of what Open MPI considers pathological, and information on where to
find this would be appreciated. I've attached the results of ompi_info
--all if it is of any use.
"If the call causes some system resource to be exhausted, then it will
fail and return an error code. Quality implementations of MPI should
ensure that this happens only in ``pathological'' cases. That is, an MPI
implementation should be able to support a large number of pending