Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] My MPI build is broke, don't know why/how
From: Jim Kusznir (jkusznir_at_[hidden])
Date: 2012-08-23 19:03:25


Hi all:

I recently rebuilt my cluster from rocks 5 to rocks 6 (which is based
on CentOS 6.2) using the official spec file and my build options as
before. It all built successfully and all appeared good. That is,
until one tried to use it. This is built with torque integration, and
its run through torque. When a user's job runs, this ends up in the
error file and the program does not run successfully:

--------------------------------------------------------------------------
Open RTE was unable to open the hostfile:
    /opt/openmpi-gcc/1.6/etc/openmpi-default-hostfile
Check to make sure the path and filename are correct.
--------------------------------------------------------------------------
[compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
file base/rmaps_base_support_fns.c at line 88
[compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
file rmaps_rr.c at line 82
[compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
file base/rmaps_base_map_job.c at line 88
[compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
file base/plm_base_launch_support.c at line 105
[compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
file plm_tm_module.c at line 194
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

This has been confirmed with several different node assignments. Any
ideas on cause or fixes?

I built it with this command:
rpmbuild -bb --define 'install_in_opt 1' --define 'install_modulefile
1' --define 'modules_rpm_name environment-modules' --define
'build_all_in_one_rpm 0' --define 'configure_options
--with-tm=/opt/torque' --define '_name openmpi-gcc' --define 'makeopts
-J8' openmpi.spec

(and the PGI version was built with:
CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 rpmbuild -bb --define
'install_in_opt 1' --define 'install_modulefile 1' --define
'modules_rpm_name environment-modules' --define 'build_all_in_one_rpm
0' --define 'configure_options --with-tm=/opt/torque' --define '_name
openmpi-pgi' --define 'use_default_rpm_opt_flags 0' openmpi.spec
)

--Jim