On Jan 25, 2010, at 11:58 AM, Mathieu Gontier wrote:
I built OpenMPI-1.4.1 without openib support with the following configuration options:
./configure --prefix=/develop/libs/OpenMPI/openmpi-1.4.1/LINUX_GCC_4_1_tcp_mach --enable-static --enable-shared --enable-cxx-exceptions --enable-mpi-f77 --disable-mpi-f90 --enable-mpi-cxx --disable-mpi-cxx-seek --enable-dist --enable-mpi-profile --enable-binaries --enable-mpi-threads --enable-memchecker --disable-debug --with-pic --with-threads --with-sge
Note that you should not use --enable-dist. --enable-dist is used by the OMPI maintainers ONLY when generating official downloadable tarballs. It is *NOT* guaranteed to make sane / correct builds for general purpose runs. Here's what ./configure --help says about --enable-dist:
--enable-dist guarantee that that the "dist" make target will be
functional, although may not guarantee that any
other make target will be functional.
Specifically: --enable-dist allows some configure tests to "pass" even though they shouldn't. For example, I don't have MX installed on my systems. But with --enable-dist, the MX tests in OMPI's configure script will "pass" just enough so that I can "make dist" to generate a tarball and still include all the MX plugin source code.
On my cluster, I run a small test (a broadcast on a 100 integer array) on 12 processes balanced on 3 nodes, but I asked for using openib. It works with the following messages:
mpirun -np 12 -hostfile /tmp/72936.1.64.q/machines --mca btl openib,sm,self /home/numeca/tmp/gontier/bcast/exe_ompi_cluster -nloop 2 -nbuff 100
Is your PATH and LD_LIBRARY_PATH set correctly such that you'll find the "right" ones (i.e., the ones that you just built/installed in /develop/libs/OpenMPI/openmpi-1.4.1/LINUX_GCC_4_1_tcp_mach)? I.e., is it possible that you're finding some other OMPI install that has OpenFabrics support?
Further, did you ever previously install Open MPI into that prefix and include OpenFabrics support? I ask because OMPI's OpenFabrics support is in the form of a plugin -- if you simply installed another copy of OMPI into the same prefix without uninstalling first, the OpenFabrics plugin could still have been left in the tree, and therefore used at run time.
Finally, note that you didn't tell Open MPI to *NOT* build OpenFabrics support. In this case, OMPI's configure script looks for OpenFabrics support, and if it finds it, builds it. But if it doesn't find OpenFabrics support (and you didn't specifically ask for it), it just skips it and keeps going. You might want to look through the output of OMPI's configure and see if it found OpenFabrics support and therefore decided to build it.
I finally run ompi_info:
./ompi_info | grep openib
MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)
Openib seems to be supported. That is weird because I did not ask for...
Yep; see above.
So, assuming the compilation of OpenMPI which does not support openib here, what happened? Was tcp selected? How can I check which device has been used (or force an explicit message)?
Unfortunately, OMPI currently lacks a good message indicating which device is used at run-time (because it's actually a surprisingly complex issue, since OMPI chooses a communication device based on which peer it's talking to, among other reasons). We hope to have a good message in sometime in the OMPI 1.5 series.
By the way, what is the meaning of this message in my case?
Do you mean this message?
WARNING: There was an error initializing an OpenFabrics device.
Local host: node005
Local device: mthca0
If so, it means that Open MPI was unable to initialize the InfiniBand HCA known as "mthca0" on the server known as node005.
The RLIMIT messages are likely symptoms of the issue; you likely need to set your registered memory limits to "unlimited". See the OMPI FAQ in the OpenFabrics section for questions about registered memory limits for instructions how.
By the way, another different think: does OpenMPI must be compiled with gcc-4.1 or later, or gcc-3.4 (for example) can be used?
gcc 3.4 should be fine.