Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] My MPI build is broke, don't know why/how
From: Ralph Castain (rhc_at_[hidden])
Date: 2012-08-23 19:29:40


The 1.6 code always expects to find the default hostfile, even if it is empty. We always install it by default, so I don't know why yours isn't there. In the future, we just ignore it if we don't find it.

You have two options:

1. create that file and leave it empty

2. you can work around it by adding --default-hostfile none to your cmd line, or adding OMPI_MCA_orte_default_hostfile=none to your environment. If you want to do this for everyone on the system, then add "orte_default_hostfile=none" to your default MCA param file.

HTH
Ralph

On Aug 23, 2012, at 4:03 PM, Jim Kusznir <jkusznir_at_[hidden]> wrote:

> Hi all:
>
> I recently rebuilt my cluster from rocks 5 to rocks 6 (which is based
> on CentOS 6.2) using the official spec file and my build options as
> before. It all built successfully and all appeared good. That is,
> until one tried to use it. This is built with torque integration, and
> its run through torque. When a user's job runs, this ends up in the
> error file and the program does not run successfully:
>
> --------------------------------------------------------------------------
> Open RTE was unable to open the hostfile:
> /opt/openmpi-gcc/1.6/etc/openmpi-default-hostfile
> Check to make sure the path and filename are correct.
> --------------------------------------------------------------------------
> [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
> file base/rmaps_base_support_fns.c at line 88
> [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
> file rmaps_rr.c at line 82
> [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
> file base/rmaps_base_map_job.c at line 88
> [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
> file base/plm_base_launch_support.c at line 105
> [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
> file plm_tm_module.c at line 194
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
>
> This has been confirmed with several different node assignments. Any
> ideas on cause or fixes?
>
> I built it with this command:
> rpmbuild -bb --define 'install_in_opt 1' --define 'install_modulefile
> 1' --define 'modules_rpm_name environment-modules' --define
> 'build_all_in_one_rpm 0' --define 'configure_options
> --with-tm=/opt/torque' --define '_name openmpi-gcc' --define 'makeopts
> -J8' openmpi.spec
>
> (and the PGI version was built with:
> CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 rpmbuild -bb --define
> 'install_in_opt 1' --define 'install_modulefile 1' --define
> 'modules_rpm_name environment-modules' --define 'build_all_in_one_rpm
> 0' --define 'configure_options --with-tm=/opt/torque' --define '_name
> openmpi-pgi' --define 'use_default_rpm_opt_flags 0' openmpi.spec
> )
>
> --Jim
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users