I have OpenMPI running fine for a small/medium number of tasks (simple hello or cpi program). But when I try 700 or 800 tasks, it hangs, apparently on startup. I think this might be related to LDAP, since if I try to log into my account while the job is hung, I get told my username doesn’t exist. However, I tried adding –debug to the mpirun, and got the same sequence of output as for successful smaller runs, until it hung again. So I added –-debug-daemons and got this (with an exit, i.e. no hanging):

[blade1:31733] [0,0,0] wrote setup file

--------------------------------------------------------------------------

The rsh launcher has been given a number of 128 concurrent daemons to

launch and is in a debug-daemons option. However, the total number of

daemons to launch (200) is greater than this value. This is a scenario that

will cause the system to deadlock.

 

To avoid deadlock, either increase the number of concurrent daemons, or

remove the debug-daemons flag.

--------------------------------------------------------------------------

[blade1:31733] [0,0,0] ORTE_ERROR_LOG: Fatal in file ../../../../../orte/mca/rmgr/urm/

rmgr_urm.c at line 455

[blade1:31733] mpirun: spawn failed with errno=-6

[blade1:31733] sess_dir_finalize: proc session dir not empty - leaving

 

Any ideas or suggestions appreciated.

 

Todd Heywood