Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] v1.7.4, mpiexec "exit 1" and no other warning - behaviour changed to previous versions
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-02-13 10:34:09


Okay, this exposed the problem. The issue is that "ib0" on the two machines is defined on two completely different IP subnets:

linuxbmc0008: 134.61.202.7
linuxscc004: 192.168.222.4

The OOB doesn't think those two are directly reachable by each other as the IP/subnet-mask don't match - we obviously require a better testing method, or maybe just default to trying the connection and fail if we can't make it. Let me ponder that one a bit.

Thanks!

On Feb 13, 2014, at 3:05 AM, Paul Kapinos <kapinos_at_[hidden]> wrote:

> Attached the output from openmpi/1.7.5a1r30708
>
> $ $MPI_BINDIR/mpiexec -mca oob_tcp_if_include ib0 -mca oob_base_verbose 100 -H linuxscc004 -np 1 hostname 2>&1 | tee oob_base_verbose-linuxbmc0008-175a1r29587.txt
>
> Well, some 5 lines added.
> (The ib0 on linuxscc004 is not reachable from linuxbmc0008 - this lead to TCP shutdown? cf. line 36-37)
>
>
> On 02/13/14 01:28, Ralph Castain wrote:
>> Could you please give the nightly 1.7.5 tarball a try using the same cmd line options and send me the output? I see the problem, but am trying to understand how it happens. I've added a bunch of diagnostic statements that should help me track it down.
>>
>> Thanks
>> Ralph
>>
>> On Feb 12, 2014, at 1:26 AM, Paul Kapinos <kapinos_at_[hidden]> wrote:
>>
>>> As said, the change in behaviour is new in 1.7.4 - all previous versions has been worked. Moreover, setting "-mca oob_tcp_if_include ib0" is a workaround for older versions of Open MPI for some 60-seconds timeout when starting the same command (which is still sucessfull); or for infinite waiting in same cases.
>>>
>>>
>>>
>>> Attached are logs of the commands:
>>> $ export | grep OMPI | tee export_OMPI-linuxbmc0008.txt
>>>
>>> $ $MPI_BINDIR/mpiexec -mca oob_tcp_if_include ib0 -mca oob_base_verbose 100 -H linuxscc004 -np 1 hostname 2>&1 | tee oob_base_verbose-linuxbmc0008-173.txt
>>>
>>> (and -174 for appropriate versions 1.7.3 and 1.7.4)
>>>
>>>
>>> $ ifconfig 2>&1 | tee ifconfig-linuxbmc0008.txt
>>>
>>> (and -linuxscc004 for the two nodes; linuxscc004 is in (h) fabric and 'mpiexec' was called from node linuxbmc0008 which is in the (b) fabric where the 'ib0' is configured to be the main interface)
>>>
>>> and the OMPI environment on linuxbmc0008. Maybe you can see something from this.
>>>
>>> Best
>>> Paul
>>>
>>>
>>> On 02/11/14 20:29, Ralph Castain wrote:
>>>> I've added better error messages in the trunk, scheduled to move over to 1.7.5. I don't see anything in the code that would explain why we don't pickup and use ib0 if it is present and specified in if_include - we should be doing it.
>>>>
>>>> For now, can you run this with "-mca oob_base_verbose 100" on your cmd line and send me the output? Might help debug the behavior.
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>> On Feb 11, 2014, at 1:22 AM, Paul Kapinos <kapinos_at_[hidden]> wrote:
>>>>
>>>>> Dear Open MPI developer,
>>>>>
>>>>> I.
>>>>> we see peculiar behaviour in the new 1.7.4 version of Open MPI which is a change to previous versions:
>>>>> - when calling "mpiexec", it returns "1" and exits silently.
>>>>>
>>>>> The behaviour is reproducible; well not that easy reproducible.
>>>>>
>>>>> We have multiple InfiniBand islands in our cluster. All nodes are passwordless reachable from each other in somehow way; some via IPoIB, for some routing you also have to use ethernet cards and IB/TCP gateways.
>>>>>
>>>>> One island (b) is configured to use the IB card as the main TCP interface. In this island, the variable OMPI_MCA_oob_tcp_if_include is set to "ib0" (*)
>>>>>
>>>>> Another island (h) is configured in convenient way: IB cards also are here and may be used for IPoIB in the island, but the "main interface" used for DNS and Hostname binds is eth0.
>>>>>
>>>>> When calling 'mpiexec' from (b) to start a process on (h), and OpenMPI version is 1.7.4, and OMPI_MCA_oob_tcp_if_include is set to "ib0", mpiexec just exits with return value "1" and no error/warning.
>>>>>
>>>>> When OMPI_MCA_oob_tcp_if_include is unset it works pretty fine.
>>>>>
>>>>> All previously versions of Open MPI (1.6.x, 1.7.3) ) did not have this behaviour; so this is aligned to v1.7.4 only. See log below.
>>>>>
>>>>> You ask why to hell starting MPI processes on other IB island? Because our front-end nodes are in the island (b) but we sometimes need to start something also on island (h), which has been worced perfectly until 1.7.4.
>>>>>
>>>>>
>>>>> (*) This is another Spaghetti Western long story. In short, we set OMPI_MCA_oob_tcp_if_include to 'ib0' in the subcluster where the IB card is configured to be the main network interface, in order to stop Open MPI trying to connect via (possibly unconfigured) ethernet cards - which lead to endless waiting, sometimes.
>>>>> Cf. http://www.open-mpi.org/community/lists/users/2011/11/17824.php
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> pk224850_at_cluster:~[523]$ module switch $_LAST_MPI openmpi/1.7.3
>>>>> Unloading openmpi 1.7.3 [ OK ]
>>>>> Loading openmpi 1.7.3 for intel compiler [ OK ]
>>>>> pk224850_at_cluster:~[524]$ $MPI_BINDIR/mpiexec -H linuxscc004 -np 1 hostname ; echo $?
>>>>> linuxscc004.rz.RWTH-Aachen.DE
>>>>> 0
>>>>> pk224850_at_cluster:~[525]$ module switch $_LAST_MPI openmpi/1.7.4
>>>>> Unloading openmpi 1.7.3 [ OK ]
>>>>> Loading openmpi 1.7.4 for intel compiler [ OK ]
>>>>> pk224850_at_cluster:~[526]$ $MPI_BINDIR/mpiexec -H linuxscc004 -np 1 hostname ; echo $?
>>>>> 1
>>>>> pk224850_at_cluster:~[527]$
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> II.
>>>>> During some experiments with envvars and v1.7.4, got the below messages.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> Sorry! You were supposed to get help about:
>>>>> no-included-found
>>>>> But I couldn't open the help file:
>>>>> /opt/MPI/openmpi-1.7.4/linux/intel/share/openmpi/help-oob-tcp.txt: No such file or directory. Sorry!
>>>>> --------------------------------------------------------------------------
>>>>> [linuxc2.rz.RWTH-Aachen.DE:13942] [[63331,0],0] ORTE_ERROR_LOG: Not available in file ess_hnp_module.c at line 314
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> Reproducing:
>>>>> $MPI_BINDIR/mpiexec -mca oob_tcp_if_include ib0 -H linuxscc004 -np 1 hostname
>>>>>
>>>>> *frome one node with no 'ib0' card*, also without infiniband. Yessir this is a bad idea, and the 1.7.3 has said more understanding "you do wrong thing":
>>>>> --------------------------------------------------------------------------
>>>>> None of the networks specified to be included for out-of-band communications
>>>>> could be found:
>>>>>
>>>>> Value given: ib0
>>>>>
>>>>> Please revise the specification and try again.
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>> No idea, why the file share/openmpi/help-oob-tcp.txt has not been installed in 1.7.4, as we compile this version in pretty the same way as previous versions..
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>> Paul Kapinos
>>>>>
>>>>> --
>>>>> Dipl.-Inform. Paul Kapinos - High Performance Computing,
>>>>> RWTH Aachen University, IT Center
>>>>> Seffenter Weg 23, D 52074 Aachen (Germany)
>>>>> Tel: +49 241/80-24915
>>>>>
>>>>
>>>
>>>
>>> --
>>> Dipl.-Inform. Paul Kapinos - High Performance Computing,
>>> RWTH Aachen University, IT Center
>>> Seffenter Weg 23, D 52074 Aachen (Germany)
>>> Tel: +49 241/80-24915
>>> <oob_base_verbose-linuxbmc0008-165.txt><oob_base_verbose-linuxbmc0008-173.txt><oob_base_verbose-linuxbmc0008-174.txt><export_OMPI-linuxbmc0008.txt><ifconfig-linuxbmc0008.txt><ifconfig-linuxscc004.txt>
>>
>
>
> --
> Dipl.-Inform. Paul Kapinos - High Performance Computing,
> RWTH Aachen University, IT Center
> Seffenter Weg 23, D 52074 Aachen (Germany)
> Tel: +49 241/80-24915
> <oob_base_verbose-linuxbmc0008-175a1r29587.txt>