Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Error when using OpenMPI with SGE multiple hosts
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-10-04 09:45:20


It looks to me like your remote nodes aren't finding the orted executable. I suspect the problem is that you need to forward the path and ld_library_path tot he remove nodes. Use the mpirun -x option to do so.

On Oct 4, 2010, at 5:08 AM, Chris Jewell wrote:

> Hi all,
>
> Firstly, hello to the mailing list for the first time! Secondly, sorry for the non-descript subject line, but I couldn't really think how to be more specific!
>
> Anyway, I am currently having a problem getting OpenMPI to work within my installation of SGE 6.2u5. I compiled OpenMPI 1.4.2 from source, and installed under /usr/local/packages/openmpi-1.4.2. Software on my system is controlled by the Modules framework which adds the bin and lib directories to PATH and LD_LIBRARY_PATH respectively when a user is connected to an execution node. I configured a parallel environment in which OpenMPI is to be used:
>
> pe_name mpi
> slots 16
> user_lists NONE
> xuser_lists NONE
> start_proc_args /bin/true
> stop_proc_args /bin/true
> allocation_rule $round_robin
> control_slaves TRUE
> job_is_first_task FALSE
> urgency_slots min
> accounting_summary FALSE
>
> I then tried a simple job submission script:
>
> #!/bin/bash
> #
> #$ -S /bin/bash
> . /etc/profile
> module add ompi gcc
> mpirun hostname
>
> If the parallel environment runs within one execution host (8 slots per host), then all is fine. However, if scheduled across several nodes, I get an error:
>
> execv: No such file or directory
> execv: No such file or directory
> execv: No such file or directory
> --------------------------------------------------------------------------
> A daemon (pid 1629) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
>
> I'm at a loss on how to start debugging this, and I don't seem to be getting anything useful using the mpirun '-d' and '-v' switches. SGE logs don't note anything. Can anyone suggest either what is wrong, or how I might progress with getting more information?
>
> Many thanks,
>
>
> Chris
>
>
>
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users