Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] mpi problems,
From: Nehemiah Dacres (dacresni_at_[hidden])
Date: 2011-03-30 16:24:29


I am trying to figure out why my jobs aren't getting distributed and need
some help. I have an install of sun cluster tools on Rockscluster 5.2
(essentially centos4u2). this user's account has its home dir shared via
nfs. I am getting some strange errors. here's an example run

[jian_at_therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/mpirun -np 3 -hostfile list
./job2.sh
bash: /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 20362) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

[jian_at_therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/
bin/ examples/ instrument/ man/
etc/ include/ lib/ share/
[jian_at_therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orte
orte-clean orted orte-iof orte-ps orterun
[jian_at_therock ~]$ /opt/SUNWhpc/HPC8.2.1c/sun/bin/orted
[therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
file runtime/orte_init.c at line 125
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[therock.slu.loc:20365] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
file orted/orted_main.c at line 325
[jian_at_therock ~]$

-- 
Nehemiah I. Dacres
System Administrator
Advanced Technology Group Saint Louis University