I'm trying to diagnose an MPI job (in this case xhpl), that fails to
start when the rank count gets fairly high into the thousands.
My symptom is the jobs fires up via slurm, and I can see all the xhpl
processes on the nodes, but it never kicks over to the next process.
My question is, what debugs should I turn on to tell me what the
system might be waiting on?
I've checked a bunch of things, but I'm probably overlooking something
trivial (which is par for me).
I'm using the Openmpi 1.6.1, Slurm 2.4.2 on CentOS 6.3, with Infiniband/PSM