Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [ompi-1.4.1] compiling without openib, running with openib + ompi141 and gcc3
From: Mathieu Gontier (mg.mailing-list_at_[hidden])
Date: 2010-01-26 04:22:31


1/ I rebuilt without --enable-dist (more secured indeed) and with explicit --without-openib/--with-openib : behaviors are better. Great.
2/ Yes, my PATH and LD_LIBRARY_PATH are correctly set
3/ There certainly were previous installations of OpenMPI on this machine, but not in the same directory; before rebuilt, I correctly uninstall the previous installations (thank you for the tip)
4/ Is there a way to have the list of the plugins in OFED?
5/ Yes, good messages about the device will be welcome, but with 1/ it is really better now
6/ The message is really more explicit explained like that (thanks)
7/ I built both my small test and OpenMPI-1.4.1 directly on my cluster with gcc-3.4 and I still have this error. Do you have any idea where the problem could come from?

==== begin ===================================================================
--mca btl    tcp,sm,self => /home/numeca/tmp/gontier/openmpi-1.4.1/LINUX_GCC_3_4_openib_cluster/lib/ (0x00002b1e280f8000) => /home/numeca/tmp/gontier/openmpi-1.4.1/LINUX_GCC_3_4_openib_cluster/lib/ (0x00002b1e28429000) => /home/numeca/tmp/gontier/openmpi-1.4.1/LINUX_GCC_3_4_openib_cluster/lib/ (0x00002b1e285ae000) => /cvos/shared/apps/ofed/1.2.5/lib64/ (0x00002b1e28728000) => /usr/lib64/ (0x000000357de00000) => /lib64/ (0x00002b1e2883b000) => /lib64/ (0x000000357e800000) => /lib64/ (0x000000357ec00000) => /lib64/tls/ (0x000000357e000000) => /lib64/tls/ (0x000000357f200000) => /lib64/tls/ (0x000000357db00000)
        /lib64/ (0x000000357d900000)
mpirun -np 12 -hostfile /tmp/72962.1.64.q/machines --mca btl    tcp,sm,self /home/numeca/tmp/gontier/bcast/exe_ompi_cluster -nloop 2 -nbuff 100
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_memchecker_base_select failed
  --> Returned value -13 instead of OPAL_SUCCESS
[node015:12413] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file /develop/libs/OpenMPI/_compile/openmpi-1.4.1/orte/runtime/orte_init.c at line 77
[node015:12413] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file /develop/libs/OpenMPI/_compile/openmpi-1.4.1/orte/tools/orterun/orterun.c at line 541
==== end ======================================================================

Many thank Jeff: the previous answers were really useful.

Jeff Squyres wrote:
On Jan 25, 2010, at 11:58 AM, Mathieu Gontier wrote:

I built OpenMPI-1.4.1 without openib support with the following configuration options:

./configure --prefix=/develop/libs/OpenMPI/openmpi-1.4.1/LINUX_GCC_4_1_tcp_mach --enable-static --enable-shared --enable-cxx-exceptions --enable-mpi-f77 --disable-mpi-f90 --enable-mpi-cxx --disable-mpi-cxx-seek --enable-dist --enable-mpi-profile --enable-binaries --enable-mpi-threads --enable-memchecker --disable-debug --with-pic --with-threads   --with-sge

Note that you should not use --enable-dist.  --enable-dist is used by the OMPI maintainers ONLY when generating official downloadable tarballs.  It is *NOT* guaranteed to make sane / correct builds for general purpose runs.  Here's what ./configure --help says about --enable-dist:

  --enable-dist           guarantee that that the "dist" make target will be
                          functional, although may not guarantee that any
                          other make target will be functional.

Specifically: --enable-dist allows some configure tests to "pass" even though they shouldn't.  For example, I don't have MX installed on my systems.  But with --enable-dist, the MX tests in OMPI's configure script will "pass" just enough so that I can "make dist" to generate a tarball and still include all the MX plugin source code.  

On my cluster, I run a small test (a broadcast on a 100 integer array) on 12 processes balanced on 3 nodes, but I asked for using openib. It works with the following messages:

mpirun -np 12 -hostfile /tmp/72936.1.64.q/machines --mca btl openib,sm,self /home/numeca/tmp/gontier/bcast/exe_ompi_cluster -nloop 2 -nbuff 100

Is your PATH and LD_LIBRARY_PATH set correctly such that you'll find the "right" ones (i.e., the ones that you just built/installed in /develop/libs/OpenMPI/openmpi-1.4.1/LINUX_GCC_4_1_tcp_mach)?  I.e., is it possible that you're finding some other OMPI install that has OpenFabrics support?

Further, did you ever previously install Open MPI into that prefix and include OpenFabrics support?  I ask because OMPI's OpenFabrics support is in the form of a plugin -- if you simply installed another copy of OMPI into the same prefix without uninstalling first, the OpenFabrics plugin could still have been left in the tree, and therefore used at run time.

Finally, note that you didn't tell Open MPI to *NOT* build OpenFabrics support.  In this case, OMPI's configure script looks for OpenFabrics support, and if it finds it, builds it.  But if it doesn't find OpenFabrics support (and you didn't specifically ask for it), it just skips it and keeps going.  You might want to look through the output of OMPI's configure and see if it found OpenFabrics support and therefore decided to build it.

I finally run ompi_info:

./ompi_info | grep openib
                 MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)

Openib seems to be supported. That is weird because I did not ask for...

Yep; see above.

So, assuming the compilation of OpenMPI which does not support openib here, what happened? Was tcp selected? How can I check which device has been used (or force an explicit message)?

Unfortunately, OMPI currently lacks a good message indicating which device is used at run-time (because it's actually a surprisingly complex issue, since OMPI chooses a communication device based on which peer it's talking to, among other reasons).  We hope to have a good message in sometime in the OMPI 1.5 series.

By the way, what is the meaning of this message in my case?

Do you mean this message?

WARNING: There was an error initializing an OpenFabrics device.

  Local host:   node005
  Local device: mthca0

If so, it means that Open MPI was unable to initialize the InfiniBand HCA known as "mthca0" on the server known as node005.  

The RLIMIT messages are likely symptoms of the issue; you likely need to set your registered memory limits to "unlimited".  See the OMPI FAQ in the OpenFabrics section for questions about registered memory limits for instructions how.

By the way, another different think: does OpenMPI must be compiled with gcc-4.1 or later, or gcc-3.4 (for example) can be used? 

gcc 3.4 should be fine.