I was trying to understand how the debugger interface is supposed to work. And if I was confused before, that feeling never disappeared.
There is one thing that I really can't figure out, and I hope that somebody (Jeff/Ralph/Rolf based on svn blame) can enlighten me.
MPIR_debug_gate. In the document accepted by the MPI Forum we have the following definition:
> MPIR_debug_gate is an integer variable that is set to 1 by the tool to notify the MPI
> processes that the debugger has attached. An MPI process may use this variable as a
> synchronization mechanism to prevent it from running away before the tool has time to
> attach to the process.
> An MPI implementation is not required to use the MPIR_debug_gate variable for synchronization. However, the MPI job control runtime system must prevent the created MPI
> processes from running beyond the return from the applications call to MPI_INIT.
In case it is not clear enough, in the section describing the startup process, we can find the following clarification:
> If the symbol MPIR_partial_attach_ok is deï¬ned in the starter process, then this
> informs the tool that the initial startup barrier is implemented by the MPI system,
> and it is not necessary to set the MPIR_debug_gate variable in any of MPI processes.
> However, if the symbol MPIR_partial_attach_ok is not deï¬ned in the starter process,
> the tool must attach and set the MPIR_debug_gate variable to 1 in each MPI processes
> to release them from the gate, even if the tool user has instructed the tool to not attach
> to all of the MPI processes.
A started process is defined as being our mpirun. In Open MPI MPIR_partial_attach_ok is defined, so the tool will suppose that we provide a means to synchronize the processes not based on MPIR_debug_gate. Therefore only one behavior if acceptable based on the text above: no MPIR_debug_gate=1 should be issued by the tool.
However, in the ompi_debuggers.c around line 226, we have an if that switch between the two acceptable behavior (MPIR_debug_gate or own mechanism) based on the fact that we are a standalone (slurmd or generic) or not. As generic is the ess loaded in most of the cases, I can't figure out how this works if the MPIR specification document has to be trusted.