Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Question about '--mca btl tcp,self'
From: Gus Correa (gus_at_[hidden])
Date: 2014-03-17 12:37:07


On 03/17/2014 10:52 AM, Jeff Squyres (jsquyres) wrote:
> To add on to what Ralph said:
>
> 1. There are two different message passing paths in OMPI:
> - "OOB" (out of band): used for control messages
> - "BTL" (byte transfer layer): used for MPI traffic
> (there are actually others, but these seem to be the relevant 2 for your setup)
>
> 2. If you don't specify which OOB interfaces
to use OMPI will (basically) just pick one.
It doesn't really matter which one it uses;
the OOB channel doesn't use too much bandwidth,
and is mostly just during startup and shutdown.
>
> The one exception to this is stdout/stderr routing.
If your MPI app writes to stdout/stderr, this also uses the OOB path.
So if you output a LOT to stdout, then the OOB interface choice might
matter.

Hi All

Not trying to hijack Jianyu's very interesting and informative questions
and thread, I have two questions and one note about it.
I promise to shut up after this.

Is the interface that OOB picks and uses
somehow related to how the hosts/nodes names listed
in a "hostfile"
(or in the mpiexec command -host option,
or in the Torque/SGE/Slurm node file,)
are resolved into IP addresses (via /etc/hosts, DNS or other mechanism)?

In other words, does OOB pick the interface associated to the IP address
that resolves the specific node name, or does OOB have its own will and
picks whatever interface it wants?

At some early point during startup I suppose mpiexec
needs to touch base first time with each node,
and I would guess the nodes' IP address
(and the corresponding interface) plays a role then.
Does OOB piggy-back that same interface to do its job?

>
> 3. If you don't specify which MPI interfaces to use, OMPI will basically find the
"best" set of interfaces and use those. IP interfaces are always rated
less than
OS-bypass interfaces (e.g., verbs/IB).

In a node outfitted with more than one Inifinband interface,
can one choose which one OMPI is going to use (say, if one wants to
reserve the other IB interface for IO)?

In other words, are there verbs/rdma syntax equivalent to

--mca btl_tcp_if_include

and to

--mca oob_tcp_if_include ?

[Perhaps something like --mca btl_openib_if_include ...?]

Forgive me if this question doesn't make sense,
for maybe on its guts verbs/rdma already has a greedy policy of using
everything available, but I don't know anything about it.

>
> Or, as you noticed, you can give a comma-delimited list of BTLs to use.
OMPI will then use -- at most -- exactly those BTLs, but definitely no
others.
Each BTL typically has an additional parameter or parameters that can be
used
to specify which interfaces to use for the network interface type that
that BTL uses.
For example, btl_tcp_if_include tells the TCP BTL which interface(s) to use.
>
> Also, note that you seem to have missed a BTL: sm (shared memory).
sm is the preferred BTL to use for same-server communication.

This may be because several FAQs skip the sm BTL, even when it would
be an appropriate/recommended choice to include in the BTL list.
For instance:

http://www.open-mpi.org/faq/?category=all#selecting-components
http://www.open-mpi.org/faq/?category=all#tcp-selection

The command line examples with an ellipsis "..." don't actually e
xclude the use of "sm", but IMHO are too vague and somewhat misleading.

I think this issue was reported/discussed before in the list,
but somehow the FAQ were not fixed.

Thank you,
Gus Correa

It is much faster than both the TCP loopback device
(which OMPI excludes by default, BTW, which is probably
why you got reachability errors when you specifying
"--mca btl tcp,self") and the verbs (i.e., "openib")
BTL for same-server communication.
>
> 4. If you don't specify anything, OMPI usually picks the best thing for you.
In your case, it'll probably be equivalent to:
>
> mpirun --mca btl openib,sm,self ...
>
> And the control messages will flow across one of your IP interfaces.
>
> 5. If you want to be specific about which one it uses,
you can specify oob_tcp_if_include. For example:
>
> mpirun --mca oob_tcp_if_include eth0 ...
>
> Make sense?
>
>
>
> On Mar 15, 2014, at 1:18 AM, Jianyu Liu <jerry_leo_at_[hidden]> wrote:
>
>>> On Mar 14, 2014, at 10:16:34 AM,Jeff Squyres <jsquyres_at_[hidden]> wrote:
>>>
>>>> On Mar 14, 2014, at 10:11 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>>
>>>>> 1. If specified '--mca btl tcp,self', which interface application will run on, use GigE adaper OR use the OpenFabrics interface in IP over IB mode (just like a high performance GigE adapter) ?
>>>>
>>>> Both - ip over ib looks just like an Ethernet adaptor
>>>
>>>
>>> To be clear: the TCP BTL will use all TCP interfaces (regardless of underlying physical transport). Your GigE adapter and your IP adapter both present IP interfaces to>the OS, and both support TCP. So the TCP BTL will use them, because it just sees the TCP/IP interfaces.
>>
>> Thanks for your kindly input.
>>
>> Please see if I have understood correctly
>>
>> Assume there are two nework
>> Gigabit Ethernet
>>
>> eth0-renamed : 192.168.[1-22].[1-14] / 255.255.192.0
>>
>> InfiniBand network
>>
>> ib0 : 172.20.[1-22].[1-4] / 255.255.0.0
>>
>>
>> 1. If specified '--mca btl tcp,self
>>
>> The control information ( such as setup and teardown ) are routed to and passed by Gigabit Ethernet in TCP/IP mode
>> The MPI messages are routed to and passed by InfiniBand network in IP over IB mode
>> On the same machine, the TCP lookback device will be used for passing control and MPI messages
>>
>> 2. If specified '--mca btl tcp,self --mca btl_tcp_if_include ib0'
>>
>> Both of control information ( such as setup and teardown ) and MPI messages are routed to and passed by InfiniBand network in IP over IB mode
>> On the same machine, The TCP lookback device will be used for passing control and MPI messages
>>
>>
>> 3. If specified '--mca btl openib,self'
>>
>> The control information ( such as setup and teardown ) are routed to and passed by InfiniBand network in IP over IB mode
>> The MPI messages are routed to and passed by InfiniBand network in RDMA mode
>> On the same machine, the TCP lookback device will be used for passing control and MPI messages
>>
>>
>> 4. If without specifiying any 'mca btl' parameters
>>
>> The control information ( such as setup and teardown ) are routed to and passed by Gigabit Ethernet in TCP/IP mode
>> The MPI messages are routed and passed by InfiniBand network in RDMA mode
>> On the same machine, the shared memory (sm) BTL will be used for control and MPI passing messages
>>
>>
>> Appreciating your kindly input
>>
>> Jianyu
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>