Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-04-12 10:59:15


Let's simplify the issue as we have no idea what your codes are doing.

Can you run two copies of hostname, for example?

What about multiple copies of an MPI version of "hello" - see the examples directory in the OMPI tarball.

On Apr 12, 2011, at 8:43 AM, Stergiou, Jonathan C CIV NSWCCD West Bethesda, 6640 wrote:

> Apologies for not clarifying. The behavior below is expected, I am just checking to see that Gemini will start-up and look for its input file. When Gemini+OpenMPI is working correctly, I expect to see the behavior below.
>
> When Gemini+OpenMPI is not working correctly (current behavior), I see the second behavior. When running with "-np 1", Gemini will start-up and look for its input file. When running with "-np 2" (or anything more than 1), Gemini never starts up. Instead, the code simply hangs up indefinitely. I showed Gemini as an example. I don't believe the issue is Gemini-related, as I've reproduced the same "hanging" behavior with two other MPI codes (Salinas, ParaDyn).
>
> The same codebase runs correctly on many other workstations (transferred from my machine (build machine) to colleague's machine via "rsync -vrlpu /opt/sierra/ targetmachine:/opt/sierra").
>
> I tried the following fixes, but still have problems:
>
> -Copy salinas (or geminimpi) locally, run "mpirun -np 2 ./salinas"
> Tried running locally, both interactively and through queueing system. No difference in behavior.
>
> -Compare "ldd salinas" and "ldd gemini" with functioning examples (examples from coworkers' workstations).
> Compared "ldd salinas" output (and "ldd geminimpi") with results from other workstations. Comparisons look fine.
>
> -Create new user account with clean profile on my workstation. Maybe it is an environment problem.
> Created new user account and sourced "/opt/sierra/install/sierra_init.sh" to set up path. No difference in behavior.
>
> -Compare /etc/profile and /etc/bashrc with "functioning" examples.
> I compared my /etc/profile and /etc/bashrc with colleagues. Comparisons don't raise any flags.
>
> I can provide other diagnostic-type information as requested.
>
> --
> Jon Stergiou
>
>
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] On Behalf Of Stergiou, Jonathan C CIV NSWCCD West Bethesda,6640
> Sent: Monday, April 11, 2011 9:53
> To: users_at_[hidden]
> Subject: [OMPI users] OpenMPI 1.4.2 Hangs When Using More Than 1 Proc
>
> I am running OpenMPI 1.4.2 under RHEL 5.5. After install, I tested with "mpirun -np 4 date"; the command returned four "date" outputs.
>
> Then I tried running two different MPI programs, "geminimpi" and "salinas". Both run correctly with "mpirun -np 1 $prog". However, both hang indefinitely when I use anything other than "-np 1".
>
> Next, I ran "mpirun --debug-daemons -np 1 geminimpi" and got the following: (this looks good, and is what I would expect)
>
> [code]
> [xxx_at_XXX_TUX01 ~]$ mpirun --debug-daemons -np 1 geminimpi
> [XXX_TUX01:06558] [[15027,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received add_local_procs
> [XXX_TUX01:06558] [[15027,0],0] orted_recv: received sync+nidmap from local proc [[15027,1],0]
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received collective data cmd
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received message_local_procs
> Fluid Proc Ready: ID, FluidMaster,LagMaster = 0 0 1
> Checking license for Gemini
> Checking license for Linux OS
> Checking internal license list
> License valid
>
> GEMINI Startup
> Gemini +++ Version 5.1.00 20110501 +++
>
> +++++ ERROR MESSAGE +++++
> FILE MISSING (Input): name = gemini.inp
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received waitpid_fired cmd
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received iof_complete cmd
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 6559 on
> node XXX_TUX01 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> [XXX_TUX01:06558] [[15027,0],0] orted_cmd: received exit
> [/code]
>
> With "mpirun --debug-daemons -np 2 geminimpi", it hangs like this: (hangs indefinitely)
>
> [code]
> [xxx_at_XXX_TUX01 ~]$ mpirun --debug-daemons -np 2 geminimpi
> [XXX_TUX01:06570] [[14983,0],0] node[0].name XXX_TUX01 daemon 0 arch ffc91200
> [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received add_local_procs
> [XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],1]
> [XXX_TUX01:06570] [[14983,0],0] orted_recv: received sync+nidmap from local proc [[14983,1],0]
> [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
> [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received collective data cmd
> [XXX_TUX01:06570] [[14983,0],0] orted_cmd: received message_local_procs
> [/code]
>
>
> I cloned my entire installation to a number of other machines to test. On all the other workstations, everything behaves correctly and various regression suites return good results.
>
> Any ideas?
>
> --
> Jon Stergiou
> Engineer
> NSWC Carderock
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users