Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-10-04 09:45:20


It looks to me like your remote nodes aren't finding the orted executable. I suspect the problem is that you need to forward the path and ld_library_path tot he remove nodes. Use the mpirun -x option to do so.

On Oct 4, 2010, at 5:08 AM, Chris Jewell wrote:

> Hi all,
>
> Firstly, hello to the mailing list for the first time! Secondly, sorry for the non-descript subject line, but I couldn't really think how to be more specific!
>
> Anyway, I am currently having a problem getting OpenMPI to work within my installation of SGE 6.2u5. I compiled OpenMPI 1.4.2 from source, and installed under /usr/local/packages/openmpi-1.4.2. Software on my system is controlled by the Modules framework which adds the bin and lib directories to PATH and LD_LIBRARY_PATH respectively when a user is connected to an execution node. I configured a parallel environment in which OpenMPI is to be used:
>
> pe_name mpi
> slots 16
> user_lists NONE
> xuser_lists NONE
> start_proc_args /bin/true
> stop_proc_args /bin/true
> allocation_rule $round_robin
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> accounting_summary FALSE
>
> I then tried a simple job submission script:
>
> #!/bin/bash
> #
> #$ -S /bin/bash
> . /etc/profile
> module add ompi gcc
> mpirun hostname
>
> If the parallel environment runs within one execution host (8 slots per host), then all is fine. However, if scheduled across several nodes, I get an error:
>
> execv: No such file or directory
> execv: No such file or directory
> execv: No such file or directory
> --------------------------------------------------------------------------
> A daemon (pid 1629) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
>
> I'm at a loss on how to start debugging this, and I don't seem to be getting anything useful using the mpirun '-d' and '-v' switches. SGE logs don't note anything. Can anyone suggest either what is wrong, or how I might progress with getting more information?
>
> Many thanks,
>
>
> Chris
>
>
>
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users