I have OpenMPI running fine for a small/medium
number of tasks (simple hello or cpi program). But when I try 700 or 800 tasks,
it hangs, apparently on startup. I think this might be related to LDAP, since
if I try to log into my account while the job is hung, I get told my username
doesn’t exist. However, I tried adding –debug to the mpirun, and
got the same sequence of output as for successful smaller runs, until it hung
again. So I added –-debug-daemons and got this (with an exit, i.e. no
hanging):
…
[blade1:31733] [0,0,0] wrote setup file
--------------------------------------------------------------------------
The rsh launcher has been given a number of 128
concurrent daemons to
launch and is in a debug-daemons option. However,
the total number of
daemons to launch (200) is greater than this value.
This is a scenario that
will cause the system to deadlock.
To avoid deadlock, either increase the number of
concurrent daemons, or
remove the debug-daemons flag.
--------------------------------------------------------------------------
[blade1:31733] [0,0,0] ORTE_ERROR_LOG: Fatal in file
../../../../../orte/mca/rmgr/urm/
rmgr_urm.c at line 455
[blade1:31733] mpirun: spawn failed with errno=-6
[blade1:31733] sess_dir_finalize: proc session dir
not empty - leaving
Any ideas or suggestions appreciated.
Todd Heywood