Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc
From: Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640 (jonathan.stergiou_at_[hidden])
Date: 2011-04-11 09:53:11


I am running OpenMPI 1.4.2 under RHEL 5.5. After install, I tested with "mpirun -np 4 date"; the command returned four "date" outputs.

Then I tried running two different MPI programs, "geminimpi" and "salinas". Both run correctly with "mpirun -np 1 $prog". However, both hang indefinitely when I use anything other than "-np 1".

Next, I ran "mpirun --debug-daemons -np 1 geminimpi" and got the following: (this looks good, and is what I would expect)

[code]
[xxx_at_XXX_TUX01 ~]$ mpirun --debug-daemons -np 1 geminimpi
[XXX_TUX01:06558] [[15027,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_recv: received sync+nidmap from local proc [[15027,1],0]
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
Fluid Proc Ready: ID, FluidMaster,LagMaster = 0 0 1
 Checking license for Gemini
 Checking license for Linux OS
 Checking internal license list
 License valid
 
 GEMINI Startup
 Gemini +++ Version 5.1.00 20110501 +++
 
 +++++ ERROR MESSAGE +++++
 FILE MISSING (Input): name = gemini.inp
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received waitpid_fired cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received iof_complete cmd
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 6559 on
node XXX_TUX01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received exit
[/code]

With "mpirun --debug-daemons -np 2 geminimpi", it hangs like this: (hangs indefinitely)

[code]
[xxx_at_XXX_TUX01 ~]$ mpirun --debug-daemons -np 2 geminimpi
[XXX_TUX01:06570] [[14983,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],1]
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],0]
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received message_local_procs
[/code]

I cloned my entire installation to a number of other machines to test. On all the other workstations, everything behaves correctly and various regression suites return good results.

Any ideas?

--
Jon Stergiou
Engineer
NSWC Carderock