Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2006-12-04 14:46:39

There are two distinct layers of software being discussed here:

- the PML (basically the back-end to MPI_SEND and friends)
- the BTL (byte transfer layer, the back-end bit movers for the ob1
and dr PMLs -- this distinction is important because there is nothing
in the PML design that forces the use of BTL's; indeed, there is at
least one current PML that does not use BTL's as the back-end bit
mover [the cm PML])

The ob1 and dr PMLs know nothing about how the back-end bitmovers
work (BTL components) -- the BTLs are given considerable freedom to
operate within their specific interface contracts.

Generally, ob1/dr queries each BTL component when Open MPI starts
up. Each BTL responds with whether it wants to run or not. If it
does, it gives back the one or more modules (think of a module as an
"instance" of a component). Typically, these modules correspond to
multiple NICs / HCAs / network endpoints. For example, if you have 2
ethernet cards, the tcp BTL will create and return 2 modules. ob1 /
dr will treat these as two paths to send data (reachability is
computed as well, of course -- ob1/dr will only send data down btls
for which the target peer is reachable). In general, ob1/dr will
round-robin across all available BTL modules when sending large
messages (as Gleb has described). See
euro-pvmmpi-2006-hpc-protocols/ for a general description of the ob1/
dr protocols.

The openib BTL can return multiple modules if multiple LIDs are
available. So the ob1/dr doesn't know that these are not physical
devices -- it just treats each module as an equivalent mechanism to
send data.

This is actually somewhat lame as a scheme, and we talked internally
about doing something more intelligent. But we decided to hold off
and let people (like you!) with real-world apps and networks give
this stuff a try and see what really works (and what doesn't work)
before trying to implement anything else.

So -- all that explanation aside -- we'd love to hear your feedback
with regards to the multi-LID stuff in Open MPI. :-)

On Dec 4, 2006, at 1:27 PM, Chevchenkovic Chevchenkovic wrote:

> Thanks for that.
> Suppose, if there there are multiple interconnects, say ethernet and
> infiniband and a million byte of data is to be sent, then in this
> case the data will be sent through infiniband (since its a fast path
> .. please correct me here if i m wrong).
> If there are mulitple such sends, do you mean to say that each send
> will go through different BTLs in a RR manner if they are connected
> to the same port?
> -chev
> On 12/4/06, Gleb Natapov <glebn_at_[hidden]> wrote:
>> On Mon, Dec 04, 2006 at 10:53:26PM +0530, Chevchenkovic
>> Chevchenkovic wrote:
>>> Hi,
>>> It is not clear from the code as mentioned by you from
>>> ompi/mca/pml/ob1/ where exactly the selection of BTL bound to a
>>> particular LID occurs. Could you please specify the file/function
>>> name
>>> for the same?
>> There is no such code there. OB1 knows nothing about LIDs. It does RR
>> over all available interconnects. It can do RR between ethernet, IB
>> and Myrinet for instance. BTL presents each LID as different
>> virtual HCA
>> to OB1 and it does round-robin between them without event knowing
>> this
>> is the same port of the same HCA.
>> Can you explain what are you trying to achieve?
>>> -chev
>>> On 12/4/06, Gleb Natapov <glebn_at_[hidden]> wrote:
>>>> On Mon, Dec 04, 2006 at 01:07:08AM +0530, Chevchenkovic
>>>> Chevchenkovic wrote:
>>>>> Also could you please tell me which part of the openMPI code
>>>>> needs to
>>>>> be touched so that I can do some modifications in it to
>>>>> incorporate
>>>>> changes regarding LID selection...
>>>> It depend what do you want to do. The part that does RR over all
>>>> available LIDs is in OB1 PML (ompi/mca/pml/ob1/), but the code
>>>> doesn't
>>>> aware of the fact that it is doing RR over different LIDs and not
>>>> different NICs (yet?).
>>>> The code that controls what LIDs will be used is in
>>>> ompi/mca/btl/openib/btl_openib_component.c.
>>>>> On 12/4/06, Chevchenkovic Chevchenkovic
>>>>> <chevchenkovic_at_[hidden]> wrote:
>>>>>> Is it possible to control the LID where the send and recvs are
>>>>>> posted.. on either ends?
>>>>>> On 12/3/06, Gleb Natapov <glebn_at_[hidden]> wrote:
>>>>>>> On Sun, Dec 03, 2006 at 07:03:33PM +0530, Chevchenkovic
>>>>>>> Chevchenkovic
>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>> I had this query. I hope some expert replies to it.
>>>>>>>> I have 2 nodes connected point-to-point using infiniband
>>>>>>>> cable. There
>>>>>>>> are multiple LIDs for each of the end node ports.
>>>>>>>> When I give an MPI_Send, Are the sends are posted on
>>>>>>>> different LIDs
>>>>>>>> on each of the end nodes OR they are they posted on the same
>>>>>>>> LID?
>>>>>>>> Awaiting your reply,
>>>>>>> It depend what version of Open MPI your are using. If you are
>>>>>>> using
>>>>>>> trunk or v1.2 beta then all available LIDs are used in RR
>>>>>>> fashion. The
>>>>>> early
>>>>>>> versions don't support LMC.
>>>>>>> --
>>>>>>> Gleb.
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> users_at_[hidden]
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>> --
>>>> Gleb.
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>> --
>> Gleb.
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]

Jeff Squyres
Server Virtualization Business Unit
Cisco Systems