I am running OpenMPI 1.4.2 under RHEL 5.5. After install, I tested with "mpirun -np 4 date"; the command returned four "date" outputs.
Then I tried running two different MPI programs, "geminimpi" and "salinas". Both run correctly with "mpirun -np 1 $prog". However, both hang indefinitely when I use anything other than "-np 1".
Next, I ran "mpirun --debug-daemons -np 1 geminimpi" and got the following: (this looks good, and is what I would expect)
[code]
[xxx_at_XXX_TUX01 ~]$ mpirun --debug-daemons -np 1 geminimpi
[XXX_TUX01:06558] [[15027,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_recv: received sync+nidmap from local proc [[15027,1],0]
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
Fluid Proc Ready: ID, FluidMaster,LagMaster = 0 0 1
Checking license for Gemini
Checking license for Linux OS
Checking internal license list
License valid
GEMINI Startup
Gemini +++ Version 5.1.00 20110501 +++
+++++ ERROR MESSAGE +++++
FILE MISSING (Input): name = gemini.inp
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received waitpid_fired cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received iof_complete cmd
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 6559 on
node XXX_TUX01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received exit
[/code]
With "mpirun --debug-daemons -np 2 geminimpi", it hangs like this: (hangs indefinitely)
[code]
[xxx_at_XXX_TUX01 ~]$ mpirun --debug-daemons -np 2 geminimpi
[XXX_TUX01:06570] [[14983,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],1]
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],0]
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received message_local_procs
[/code]
I cloned my entire installation to a number of other machines to test. On all the other workstations, everything behaves correctly and various regression suites return good results.
Any ideas?
--
Jon Stergiou
Engineer
NSWC Carderock
|