Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc
From: Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640 (jonathan.stergiou_at_[hidden])
Date: 2011-04-12 10:43:58


Apologies for not clarifying. The behavior below is expected, I am just checking to see that Gemini will start-up and look for its input file. When Gemini+OpenMPI is working correctly, I expect to see the behavior below.

When Gemini+OpenMPI is not working correctly (current behavior), I see the second behavior. When running with "-np 1", Gemini will start-up and look for its input file. When running with "-np 2" (or anything more than 1), Gemini never starts up. Instead, the code simply hangs up indefinitely. I showed Gemini as an example. I don't believe the issue is Gemini-related, as I've reproduced the same "hanging" behavior with two other MPI codes (Salinas, ParaDyn).

The same codebase runs correctly on many other workstations (transferred from my machine (build machine) to colleague's machine via "rsync -vrlpu /opt/sierra/ targetmachine:/opt/sierra").

I tried the following fixes, but still have problems:

-Copy salinas (or geminimpi) locally, run "mpirun -np 2 ./salinas"
Tried running locally, both interactively and through queueing system. No difference in behavior.

-Compare "ldd salinas" and "ldd gemini" with functioning examples (examples from coworkers' workstations).
Compared "ldd salinas" output (and "ldd geminimpi") with results from other workstations. Comparisons look fine.

-Create new user account with clean profile on my workstation. Maybe it is an environment problem.
Created new user account and sourced "/opt/sierra/install/sierra_init.sh" to set up path. No difference in behavior.

-Compare /etc/profile and /etc/bashrc with "functioning" examples.
I compared my /etc/profile and /etc/bashrc with colleagues. Comparisons don't raise any flags.

I can provide other diagnostic-type information as requested.

--
Jon Stergiou
-----Original Message-----
From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Stergiou, Jonathan C CIV NSWCCD West Bethesda,6640
Sent: Monday, April 11, 2011 9:53
To: users_at_[hidden]
Subject: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc
I am running OpenMPI 1.4.2 under RHEL 5.5.  After install, I tested with "mpirun -np 4 date"; the command returned four "date" outputs. 
Then I tried running two different MPI programs, "geminimpi" and "salinas".  Both run correctly with "mpirun -np 1 $prog".  However, both hang indefinitely when I use anything other than "-np 1".  
Next, I ran "mpirun --debug-daemons -np 1 geminimpi" and got the following:  (this looks good, and is what I would expect)
[code]
[xxx_at_XXX_TUX01 ~]$ mpirun --debug-daemons -np 1 geminimpi
[XXX_TUX01:06558] [[15027,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_recv: received sync+nidmap from local proc [[15027,1],0]
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
Fluid Proc Ready: ID, FluidMaster,LagMaster =     0    0    1
 Checking license for Gemini
 Checking license for Linux OS
 Checking internal license list
 License valid
 
 GEMINI Startup
 Gemini +++ Version 5.1.00  20110501 +++    
 
 +++++ ERROR MESSAGE +++++
 FILE MISSING (Input): name = gemini.inp
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received waitpid_fired cmd
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received iof_complete cmd
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 6559 on
node XXX_TUX01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[XXX_TUX01:06558] [[15027,0],0] orted_cmd: received exit
[/code]
With "mpirun --debug-daemons -np 2 geminimpi", it hangs like this: (hangs indefinitely)
[code]
[xxx_at_XXX_TUX01 ~]$ mpirun --debug-daemons -np 2 geminimpi
[XXX_TUX01:06570] [[14983,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received add_local_procs
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],1]
[XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],0]
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
[XXX_TUX01:06570] [[14983,0],0] orted_cmd: received message_local_procs
[/code]
I cloned my entire installation to a number of other machines to test.  On all the other workstations, everything behaves correctly and various regression suites return good results. 
Any ideas? 
--
Jon Stergiou
Engineer
NSWC Carderock
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users


  • application/x-pkcs7-signature attachment: smime.p7s