On 8 Nov 2011, at 00:59, George Bosilca wrote:
> A started process is defined as being our mpirun. In Open MPI MPIR_partial_attach_ok is defined, so the tool will suppose that we provide a means to synchronize the processes not based on MPIR_debug_gate. Therefore only one behavior if acceptable based on the text above: no MPIR_debug_gate=1 should be issued by the tool.
Open MPI itself (Via ORTE) is not the only possible launch mechanism for Open MPI jobs, Slurm is the only other tool I can think of of the top of my head that can do it but I wouldn't be surprised if there are others. At the time the document was written it was assumed that the MPI library and resource manager/job launcher were so closely integrated they could be assumed to be part of the same software.
> However, in the ompi_debuggers.c around line 226, we have an if that switch between the two acceptable behavior (MPIR_debug_gate or own mechanism) based on the fact that we are a standalone (slurmd or generic) or not. As generic is the ess loaded in most of the cases, I can't figure out how this works if the MPIR specification document has to be trusted.
Unless the library can guarantee that the starter process has MPIR_partial_attach_ok the only safe thing it can do it wait on MPIR_debug_gate, the only way the library can make any guarantees about mpirun is if it's launched from orted.
I agree that it's not clear this, I don't think this spec is well understood by anyone, indeed it wasn't originally written with the intention of becoming a specification at all. I've looked at it a couple of times but never used this aspect of it, padb (and I believe stat is the same) don't ever launch jobs under control of the debugger, simply attach to an already existing job which means I've been able to ignore this part of the spec in padb entirely.