Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-07-08 19:03:13


Sorry -- I got distracted all afternoon...

In addition to what Ralph said (i.e., I'm not sure if the CIDR notation stuff made it over to the v1.5 branch or not, but it is available from the nightly SVN trunk tarballs: http://www.open-mpi.org/nightly/trunk/), here's a few points from other mails in this thread...

1. Gus is correct that OMPI is complaining that bge1 doesn't exist on all nodes. The MCA parameters that you pass on the command line get shipped to *all* MPI processes, and therefore generally need to work on all of them. If you have per-host MCA parameter values, you can set them a few different ways:

- have a per-host MCA param file, usually in $prefix/etc/openmpi-mca-params.conf
- have your shell startup files intelligently determine which host you're on and set the corresponding MCA environment variable as appropriate (e.g., on the head node, set the env variable OMPI_MCA_btl_tcp_if_include to bge1, and set it to bge0 on the others)

Those are a little klunky, but having a heterogeneous setup like this is not common, so we haven't really optimized the ability to set different MCA params on different servers.

2. I am curious to figure out why the automatic reachability computations isn't working for you. Unfortunately, the code to compute the reachability is pretty gnarly. :-\ The code to find the IP interfaces on your machines is in opal/util/if.c. That *should* be working -- there's *BSD-specific code in there that has been verified by others in the past... but who knows? Perhaps it has bit-rotted...? The code to take these IP interfaces and figure out if a given peer is reachable is in ompi/mca/btl/tcp/btl_tcp_proc.c:mca_btl_tcp_proc_insert(). This requires a little explanation...

- There is one TCP BTL "component". Think of this as the plugin that is dlopen'd into the process itself. It contains some high-level information about the plugin itself (e.g., the version number, ...etc.).

- There is one TCP BTL "module" per IP interface that is used for MPI communications. So your head node will have 2 TCP BTL modules and the others will only have one TCP BTL module. A module is a struct with a bunch of function pointers and some meta data (e.g., which IP interface it "owns", etc.).

- During the BTL module's initialization, btl_tcp.c:mca_btl_tcp_add_procs() is called to notify the module of all of its peers (an ompi_proc_t instance is used to describe a peer process -- note: a *process*, not any particular communications method or IP address of that process). btl_tcp_add_procs() takes the array of ompi_proc_t instances (that correspond to all the MPI processes in MPI_COMM_WORLD) and tries to figure out if this particular TCP BTL module can "reach" that peer, per the algorithm described in the FAQ that I cited earlier.

- mca_btl_tcp_add_procs() calls mca_btl_tcp_proc_insert() to do the reachability computation. If _insert() succeeds, then _add_procs() assumes that this module can reach that process and proceeds accordingly. If _insert() fails, then _add_procs() assumes that this module cannot reach that peer and proceeds accordingly.

- mca_btl_tcp_proc_insert() has previously learned about all the IP addresses of all the peer MPI processes via a different mechanism called the modex (which I won't go into here). It basically checks the one peer process in question, looks up that peer's IP addresses (aka "endpoints", from that peer's TCP BTL modules), and tries to find the best quality match that it can. It basically makes a 2D graph of weights of how "good" the connection is to each of the peer process' endpoints. It then finds the best connection and uses that one.

- We unfortunately do not have good debugging output in _proc_insert(), so you might need to step through this with a debugger. :-( I have a long-languished branch that adds lots of debugging output in this reachability computation area, but I have never finished it (it has some kind of bug in it that prevents it from working, which is why I haven't merged it into the mainline).

This was a long explanation -- I hope it helps... Is there any chance you could dig into this to see what's going on? The short version is that all this code *should* automatically figure out that the 10.x interface should effectively end up getting ignored because it can't be used to commuicate with any of its TCP BTL module peers in the other processes on the other servers.

We unfortunately don't have access to any BSD machines to test this on, ourselves. It works on other OS's, so I'm curious as to why it doesn't seem to work for you. :-(

On Jul 8, 2011, at 5:37 PM, Ralph Castain wrote:

> We've been moving to provide support for including values as CIDR notation instead of names - e.g., 192.168.0/16 instead of bge0 or bge1 - but I don't think that has been put into the 1.4 release series. If you need it now, you might try using the developer's trunk - I know it works there.
>
>
> On Jul 8, 2011, at 2:49 PM, Steve Kargl wrote:
>
>> On Fri, Jul 08, 2011 at 04:26:35PM -0400, Gus Correa wrote:
>>> Steve Kargl wrote:
>>>> On Fri, Jul 08, 2011 at 02:19:27PM -0400, Jeff Squyres wrote:
>>>>> The easiest way to fix this is likely to use the btl_tcp_if_include
>>>>> or btl_tcp_if_exclude MCA parameters -- i.e., tell OMPI exactly
>>>>> which interfaces to use:
>>>>>
>>>>> http://www.open-mpi.org/faq/?category=tcp#tcp-selection
>>>>>
>>>>
>>>> Perhaps, I'm again misreading the output, but it appears that
>>>> 1.4.4rc2 does not even see the 2nd nic.
>>>>
>>>> hpc:kargl[317] ifconfig bge0
>>>> bge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>>> options=8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE>
>>>> ether 00:e0:81:40:48:92
>>>> inet 10.208.78.111 netmask 0xffffff00 broadcast 10.208.78.255
>>>> hpc:kargl[318] ifconfig bge1
>>>> bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>>> options=8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE>
>>>> ether 00:e0:81:40:48:93
>>>> inet 192.168.0.10 netmask 0xffffff00 broadcast 192.168.0.255
>>>>
>>>> kargl[319] /usr/local/openmpi-1.4.4/bin/mpiexec --mca btl_base_verbose 30 \
>>>> --mca btl_tcp_if_include bge1 -machinefile mf1 ./z
>>>>
>>>> hpc:kargl[320] /usr/local/openmpi-1.4.4/bin/mpiexec --mca btl_base_verbose
>>>> 10 --mca btl_tcp_if_include bge1 -machinefile mf1 ./z
>>>> [hpc.apl.washington.edu:12295] mca: base: components_open: Looking for btl
>>>> [node11.cimu.org:21878] select: init of component self returned success
>>>> [node11.cimu.org:21878] select: initializing btl component sm
>>>> [node11.cimu.org:21878] select: init of component sm returned success
>>>> [node11.cimu.org:21878] select: initializing btl component tcp
>>>> [node11.cimu.org][[13916,1],1][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances] invalid interface "bge1"
>>>> [node11.cimu.org:21878] select: init of component tcp returned success
>>>> --------------------------------------------------------------------------
>>>> At least one pair of MPI processes are unable to reach each other for
>>>> MPI communications. This means that no Open MPI device has indicated
>>>> that it can be used to communicate between these processes. This is
>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>> each other. This error can sometimes be the result of forgetting to
>>>> specify the "self" BTL.
>>>>
>>> Hi Steve
>>>
>>> It is complaining that bge1 is not valid on node11, not on node10/hpc,
>>> where you ran ifconfig.
>>>
>>> Would the names of the interfaces and the matching subnet/IP
>>> vary from node to node?
>>> (E.g. bge0 be associated to 192.168.0.11 on node11, instead of bge1.)
>>>
>>> Would it be possible that only on node10 bge1 is on the 192.168.0.0
>>> subnet, but on the other nodes it is bge0 that connects
>>> to the 192.168.0.0 subnet perhaps?
>>
>> node10 has bge0 = 10.208.x.y and bge1 = 192.168.0.10.
>> node11 through node21 use bge0 = 192.168.0.N where N = 11, ..., 21.
>>
>>> If you're including only bge1 on your mca btl switch,
>>> supposedly all nodes are able to reach
>>> each other via an interface called bge1.
>>> Is this really the case?
>>> You may want to run ifconfig on all nodes to check.
>>>
>>> Alternatively, you could exclude node10 from your host file
>>> and try to run the job on the remaining nodes
>>> (and maybe not restrict the interface names with any btl switch).
>>
>> Completely exclude node10 would appear to work. Of course,
>> this then loses the 4 cpus and 16 GB of memory that are
>> in node.
>>
>> The question to me is why does 1.4.2 work without a
>> problem, and 1.4.3 and 1.4.4 have problems with a
>> node with 2 NICs.
>>
>> I suppose a follow-on question is: Is there some
>> way to get 1.4.4 to exclusive use bge1 on node10
>> while using bge0 on the other nodes?
>>
>> --
>> Steve
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/