Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-07-08 17:37:41


We've been moving to provide support for including values as CIDR notation instead of names - e.g., 192.168.0/16 instead of bge0 or bge1 - but I don't think that has been put into the 1.4 release series. If you need it now, you might try using the developer's trunk - I know it works there.

On Jul 8, 2011, at 2:49 PM, Steve Kargl wrote:

> On Fri, Jul 08, 2011 at 04:26:35PM -0400, Gus Correa wrote:
>> Steve Kargl wrote:
>>> On Fri, Jul 08, 2011 at 02:19:27PM -0400, Jeff Squyres wrote:
>>>> The easiest way to fix this is likely to use the btl_tcp_if_include
>>>> or btl_tcp_if_exclude MCA parameters -- i.e., tell OMPI exactly
>>>> which interfaces to use:
>>>>
>>>> http://www.open-mpi.org/faq/?category=tcp#tcp-selection
>>>>
>>>
>>> Perhaps, I'm again misreading the output, but it appears that
>>> 1.4.4rc2 does not even see the 2nd nic.
>>>
>>> hpc:kargl[317] ifconfig bge0
>>> bge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>> options=8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE>
>>> ether 00:e0:81:40:48:92
>>> inet 10.208.78.111 netmask 0xffffff00 broadcast 10.208.78.255
>>> hpc:kargl[318] ifconfig bge1
>>> bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>> options=8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE>
>>> ether 00:e0:81:40:48:93
>>> inet 192.168.0.10 netmask 0xffffff00 broadcast 192.168.0.255
>>>
>>> kargl[319] /usr/local/openmpi-1.4.4/bin/mpiexec --mca btl_base_verbose 30 \
>>> --mca btl_tcp_if_include bge1 -machinefile mf1 ./z
>>>
>>> hpc:kargl[320] /usr/local/openmpi-1.4.4/bin/mpiexec --mca btl_base_verbose
>>> 10 --mca btl_tcp_if_include bge1 -machinefile mf1 ./z
>>> [hpc.apl.washington.edu:12295] mca: base: components_open: Looking for btl
>>> [node11.cimu.org:21878] select: init of component self returned success
>>> [node11.cimu.org:21878] select: initializing btl component sm
>>> [node11.cimu.org:21878] select: init of component sm returned success
>>> [node11.cimu.org:21878] select: initializing btl component tcp
>>> [node11.cimu.org][[13916,1],1][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances] invalid interface "bge1"
>>> [node11.cimu.org:21878] select: init of component tcp returned success
>>> --------------------------------------------------------------------------
>>> At least one pair of MPI processes are unable to reach each other for
>>> MPI communications. This means that no Open MPI device has indicated
>>> that it can be used to communicate between these processes. This is
>>> an error; Open MPI requires that all MPI processes be able to reach
>>> each other. This error can sometimes be the result of forgetting to
>>> specify the "self" BTL.
>>>
>> Hi Steve
>>
>> It is complaining that bge1 is not valid on node11, not on node10/hpc,
>> where you ran ifconfig.
>>
>> Would the names of the interfaces and the matching subnet/IP
>> vary from node to node?
>> (E.g. bge0 be associated to 192.168.0.11 on node11, instead of bge1.)
>>
>> Would it be possible that only on node10 bge1 is on the 192.168.0.0
>> subnet, but on the other nodes it is bge0 that connects
>> to the 192.168.0.0 subnet perhaps?
>
> node10 has bge0 = 10.208.x.y and bge1 = 192.168.0.10.
> node11 through node21 use bge0 = 192.168.0.N where N = 11, ..., 21.
>
>> If you're including only bge1 on your mca btl switch,
>> supposedly all nodes are able to reach
>> each other via an interface called bge1.
>> Is this really the case?
>> You may want to run ifconfig on all nodes to check.
>>
>> Alternatively, you could exclude node10 from your host file
>> and try to run the job on the remaining nodes
>> (and maybe not restrict the interface names with any btl switch).
>
> Completely exclude node10 would appear to work. Of course,
> this then loses the 4 cpus and 16 GB of memory that are
> in node.
>
> The question to me is why does 1.4.2 work without a
> problem, and 1.4.3 and 1.4.4 have problems with a
> node with 2 NICs.
>
> I suppose a follow-on question is: Is there some
> way to get 1.4.4 to exclusive use bge1 on node10
> while using bge0 on the other nodes?
>
> --
> Steve
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users