Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD
From: Gus Correa (gus_at_[hidden])
Date: 2011-07-08 16:26:35


Steve Kargl wrote:
> On Fri, Jul 08, 2011 at 02:19:27PM -0400, Jeff Squyres wrote:
>> The easiest way to fix this is likely to use the btl_tcp_if_include
>> or btl_tcp_if_exclude MCA parameters -- i.e., tell OMPI exactly
>> which interfaces to use:
>>
>> http://www.open-mpi.org/faq/?category=tcp#tcp-selection
>>
>
> Perhaps, I'm again misreading the output, but it appears that
> 1.4.4rc2 does not even see the 2nd nic.
>
> hpc:kargl[317] ifconfig bge0
> bge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
> options=8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE>
> ether 00:e0:81:40:48:92
> inet 10.208.78.111 netmask 0xffffff00 broadcast 10.208.78.255
> inet6 fe80::2e0:81ff:fe40:4892%bge0 prefixlen 64 scopeid 0x3
> nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
> media: Ethernet autoselect (1000baseT <full-duplex>)
> status: active
> hpc:kargl[318] ifconfig bge1
> bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
> options=8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE>
> ether 00:e0:81:40:48:93
> inet 192.168.0.10 netmask 0xffffff00 broadcast 192.168.0.255
> inet6 fe80::2e0:81ff:fe40:4893%bge1 prefixlen 64 scopeid 0x4
> nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
> media: Ethernet autoselect (1000baseT <full-duplex>)
> status: active
>
> kargl[319] /usr/local/openmpi-1.4.4/bin/mpiexec --mca btl_base_verbose 30 \
> --mca btl_tcp_if_include bge1 -machinefile mf1 ./z
>
> hpc:kargl[320] /usr/local/openmpi-1.4.4/bin/mpiexec --mca btl_base_verbose 10 --mca btl_tcp_if_include bge1 -machinefile mf1 ./z
> [hpc.apl.washington.edu:12295] mca: base: components_open: Looking for btl components
> [hpc.apl.washington.edu:12295] mca: base: components_open: opening btl components
> [hpc.apl.washington.edu:12295] mca: base: components_open: found loaded component self
> [hpc.apl.washington.edu:12295] mca: base: components_open: component self has no register function
> [hpc.apl.washington.edu:12295] mca: base: components_open: component self open function successful
> [hpc.apl.washington.edu:12295] mca: base: components_open: found loaded component sm
> [hpc.apl.washington.edu:12295] mca: base: components_open: component sm has no register function
> [hpc.apl.washington.edu:12295] mca: base: components_open: component sm open function successful
> [hpc.apl.washington.edu:12295] mca: base: components_open: found loaded component tcp
> [hpc.apl.washington.edu:12295] mca: base: components_open: component tcp has no register function
> [hpc.apl.washington.edu:12295] mca: base: components_open: component tcp open function successful
> [hpc.apl.washington.edu:12295] select: initializing btl component self
> [hpc.apl.washington.edu:12295] select: init of component self returned success
> [hpc.apl.washington.edu:12295] select: initializing btl component sm
> [hpc.apl.washington.edu:12295] select: init of component sm returned success
> [hpc.apl.washington.edu:12295] select: initializing btl component tcp
> [hpc.apl.washington.edu:12295] select: init of component tcp returned success
> [node11.cimu.org:21878] mca: base: components_open: Looking for btl components
> [node11.cimu.org:21878] mca: base: components_open: opening btl components
> [node11.cimu.org:21878] mca: base: components_open: found loaded component self
> [node11.cimu.org:21878] mca: base: components_open: component self has no register function
> [node11.cimu.org:21878] mca: base: components_open: component self open function successful
> [node11.cimu.org:21878] mca: base: components_open: found loaded component sm
> [node11.cimu.org:21878] mca: base: components_open: component sm has no register function
> [node11.cimu.org:21878] mca: base: components_open: component sm open function successful
> [node11.cimu.org:21878] mca: base: components_open: found loaded component tcp
> [node11.cimu.org:21878] mca: base: components_open: component tcp has no register function
> [node11.cimu.org:21878] mca: base: components_open: component tcp open function successful
> [node11.cimu.org:21878] select: initializing btl component self
> [node11.cimu.org:21878] select: init of component self returned success
> [node11.cimu.org:21878] select: initializing btl component sm
> [node11.cimu.org:21878] select: init of component sm returned success
> [node11.cimu.org:21878] select: initializing btl component tcp
> [node11.cimu.org][[13916,1],1][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances] invalid interface "bge1"
> [node11.cimu.org:21878] select: init of component tcp returned success
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
Hi Steve

It is complaining that bge1 is not valid on node11, not on node10/hpc,
where you ran ifconfig.

Would the names of the interfaces and the matching subnet/IP
vary from node to node?
(E.g. bge0 be associated to 192.168.0.11 on node11, instead of bge1.)

Would it be possible that only on node10 bge1 is on the 192.168.0.0
subnet, but on the other nodes it is bge0 that connects
to the 192.168.0.0 subnet perhaps?

If you're including only bge1 on your mca btl switch,
supposedly all nodes are able to reach
each other via an interface called bge1.
Is this really the case?
You may want to run ifconfig on all nodes to check.

Alternatively, you could exclude node10 from your host file
and try to run the job on the remaining nodes
(and maybe not restrict the interface names with any btl switch).

I hope this helps,
Gus Correa

PS - Your next email, saying that it works with
"--mca btl_tcp_if_include bge1,bge0"
kind of hints that node11 and higher use bge0 for 192.168.0.0,
instead of bge1.