Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI 1.3RC2 job startup issue
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-12-22 18:17:36


Your backend nodes are mistakenly picking up the OMPI 1.2 orted binary
instead of the 1.3 orted. The two are not compatible.

Check your LD_LIBRARY_PATH and PATH on the backend nodes and ensure
they are pointing at the 1.3 installation. There are other ways as
well of pointing to the correct installation - check the OMPI FAQ
pages to find alternatives if this doesn't work for you.

Ralph

On Dec 22, 2008, at 2:58 PM, Ray Muno wrote:

> We have been happily running under OpenMPI 1.2 on our cluster unitil
> recently. It is 2200 processors (8 way Opteron) , Qlogic IB
> connected.
>
> We have had issues starting larger jobs (600+ processors). There
> seemed to be some indication that OpenMPI may solve our problems.
>
> It built with no problem and installed. Users can compile programs.
>
> When they tried to run, they got the attached output. Are we
> missing something obvious?
>
> This is a Rocks cluster with jobs scheduled through SGE.
>
> =====================================================
> $ mpirun -np 1024 program
>
> [compute-2-6.local:32580] Error: unknown option "--daemonize"
> Usage: orted [OPTION]...
> Start an Open RTE Daemon
>
> --bootproxy <arg0> Run as boot proxy for <job-id>
> -d|--debug Debug the OpenRTE
> -d|--spin Have the orted spin until we can connect a
> debugger
> to it
> --debug-daemons Enable debugging of OpenRTE daemons
> --debug-daemons-file Enable debugging of OpenRTE daemons, storing
> output
> in files
> --gprreplica <arg0> Registry contact information.
> -h|--help This help message
> --mpi-call-yield <arg0>
> Have MPI (or similar) applications call
> yield when
> idle
> --name <arg0> Set the orte process name
> --no-daemonize Don't daemonize into the background
> --nodename <arg0> Node name as specified by host/resource
> description.
> --ns-nds <arg0> set sds/nds component to use for daemon
> (normally
> not needed)
> --nsreplica <arg0> Name service contact information.
> --num_procs <arg0> Set the number of process in this job
> --persistent Remain alive after the application process
> completes
> --report-uri <arg0> Report this process' uri on indicated pipe
> --scope <arg0> Set restrictions on who can connect to this
> universe
> --seed Host replicas for the core universe services
> --set-sid Direct the orted to separate from the current
> session
> --tmpdir <arg0> Set the root for the session directory tree
> --universe <arg0> Set the universe name as
> username_at_hostname:universe_name for this
> application
> --vpid_start <arg0> Set the starting vpid for this job
> --------------------------------------------------------------------------
> A daemon (pid 4151) died unexpectedly with status 251 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed
> shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to
> have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
> compute-5-15.local - daemon did not report back when launched
> compute-5-35.local - daemon did not report back when launched
> compute-4-8.local - daemon did not report back when launched
> compute-7-2.local - daemon did not report back when launched
> compute-2-6.local - daemon did not report back when launched
> compute-6-28.local - daemon did not report back when launched
> compute-6-35.local - daemon did not report back when launched
> compute-6-25.local
> compute-6-26.local
> compute-2-19.local - daemon did not report back when launched
> compute-6-37.local - daemon did not report back when launched
> compute-6-12.local - daemon did not report back when launched
> compute-2-36.local - daemon did not report back when launched
> compute-7-5.local - daemon did not report back when launched
> compute-7-23.local - daemon did not report back when launched
>
> ================================================
>
> --
>
> Ray Muno
> University of Minnesota
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users