Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] v1.7.4, mpiexec "exit 1" and no other warning - behaviour changed to previous versions
From: Paul Kapinos (kapinos_at_[hidden])
Date: 2014-02-11 04:22:45

Dear Open MPI developer,

we see peculiar behaviour in the new 1.7.4 version of Open MPI which is a change
to previous versions:
- when calling "mpiexec", it returns "1" and exits silently.

The behaviour is reproducible; well not that easy reproducible.

We have multiple InfiniBand islands in our cluster. All nodes are passwordless
reachable from each other in somehow way; some via IPoIB, for some routing you
also have to use ethernet cards and IB/TCP gateways.

One island (b) is configured to use the IB card as the main TCP interface. In
this island, the variable OMPI_MCA_oob_tcp_if_include is set to "ib0" (*)

Another island (h) is configured in convenient way: IB cards also are here and
may be used for IPoIB in the island, but the "main interface" used for DNS and
Hostname binds is eth0.

When calling 'mpiexec' from (b) to start a process on (h), and OpenMPI version
is 1.7.4, and OMPI_MCA_oob_tcp_if_include is set to "ib0", mpiexec just exits
with return value "1" and no error/warning.

When OMPI_MCA_oob_tcp_if_include is unset it works pretty fine.

All previously versions of Open MPI (1.6.x, 1.7.3) ) did not have this
behaviour; so this is aligned to v1.7.4 only. See log below.

You ask why to hell starting MPI processes on other IB island? Because our
front-end nodes are in the island (b) but we sometimes need to start something
also on island (h), which has been worced perfectly until 1.7.4.

(*) This is another Spaghetti Western long story. In short, we set
OMPI_MCA_oob_tcp_if_include to 'ib0' in the subcluster where the IB card is
configured to be the main network interface, in order to stop Open MPI trying to
connect via (possibly unconfigured) ethernet cards - which lead to endless
waiting, sometimes.

pk224850_at_cluster:~[523]$ module switch $_LAST_MPI openmpi/1.7.3

Unloading openmpi 1.7.3
                         [ OK ]
Loading openmpi 1.7.3 for intel compiler
                         [ OK ]
pk224850_at_cluster:~[524]$ $MPI_BINDIR/mpiexec -H linuxscc004 -np 1 hostname ;
echo $?
pk224850_at_cluster:~[525]$ module switch $_LAST_MPI openmpi/1.7.4

Unloading openmpi 1.7.3
                         [ OK ]
Loading openmpi 1.7.4 for intel compiler
                         [ OK ]
pk224850_at_cluster:~[526]$ $MPI_BINDIR/mpiexec -H linuxscc004 -np 1 hostname ;
echo $?

During some experiments with envvars and v1.7.4, got the below messages.

Sorry! You were supposed to get help about:
But I couldn't open the help file:
     /opt/MPI/openmpi-1.7.4/linux/intel/share/openmpi/help-oob-tcp.txt: No such
file or directory. Sorry!
[linuxc2.rz.RWTH-Aachen.DE:13942] [[63331,0],0] ORTE_ERROR_LOG: Not available in
file ess_hnp_module.c at line 314

$MPI_BINDIR/mpiexec -mca oob_tcp_if_include ib0 -H linuxscc004 -np 1 hostname

*frome one node with no 'ib0' card*, also without infiniband. Yessir this is a
bad idea, and the 1.7.3 has said more understanding "you do wrong thing":
None of the networks specified to be included for out-of-band communications
could be found:

   Value given: ib0

Please revise the specification and try again.

No idea, why the file share/openmpi/help-oob-tcp.txt has not been installed in
1.7.4, as we compile this version in pretty the same way as previous versions..

Paul Kapinos

Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, IT Center
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915