Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Rich L. Graham (rlgraham_at_[hidden])
Date: 2005-09-03 10:41:32


Brad,

On Sep 2, 2005, at 6:17 PM, Brad Penoff wrote:

> hey Jeff/Galen,
>
> Thanks to both of you for helping answer our questions, both on and off
> the list. Currently, we're doing a lot of writing trying to focus on
> MPI
> implementation design strategies, so this has helped us certainly;
> hopefully others too.
>
> On our end, generally, we've been trying to push as much functionality
> down to the transport (we have some info on our webpage :
> http://www.cs.ubc.ca/labs/dsg/mpi-sctp/ or you can hear me talk at
> SC|05 )
> where as your approach is to bring functionality up and manage it
> within
> the middleware (obviously you do a lot of other neat things like thread
> safety and countless other things that are really impressive). With
> respect to managing interfaces in the middleware, I understand it buys
> you
> some generality though since channel bonding (for TCP) and concurrent
> multipath transfer (for SCTP) aren't available for mVAPI, Open IB, GM,
> MX,
> etc.
>
> Already, I think it's cool to read about OpenMPI's design; in the
> future,
> it will be cooler to hear if pulling so much functionality up to the
> middleware has any performance drawbacks from having to do so much
> management (so comparing for example, a setup with two NICs using
> OpenMPI
> striping to that of a thinner middleware that has the same setup but
> uses
> channel bonding). From the looks of it, your Euro PVM/MPI paper is
> going
> to tell about the low cost of software components; I'm just curious of
> the
> costs of even having this management functionality in the middleware in
> the first place; time will tell!

This is our 3rd generation+ of developing pt-2-pt messaging interfaces,
so
we have built on close to seven years of experience developing these
sorts
of interfaces dealing with supporting our ASC production systems with
this
source code base. This line of work (starting in what eventually became
LA-MPI, which is being superseded with a far cleaner and more flexible
design) started in the context of our production systems that supported
13
interfaces between hosts ... I guess what I am trying to say is the
complexity
resides somewhere, and we have opted to put together a system that can,
if the user so chooses, use all available communications resources for a
single large message. While this may seem to be a complex problem, when
you are on your third or fourth (depends on how you count) design, a lot
has been learned (and a lot of simplification has occurred).

One thing I should add. One of the nice things about the Open MPI
design is
that it does not al all preclude supporting both, even in the same .so.
  As a
matter of fact what Brian did for the Euro PVM/MPI paper was take our
previous
version of the pt-2-pt implementation, and simplify it to eliminate
much of
this scheduling logic.

Finally, this design is aimed at running apps on (yet to be realized)
peta-scale
platforms, where we don't want an app to fail, just because one of many
interfaces fails. In this case we do want to keep the various
interfaces available
independent, so that we can deal with the failures w/o killing the app
...

Rich

>
> Thanks again for all your answers,
>
> brad
>
>
> On Wed, 31 Aug 2005, Galen M. Shipman wrote:
>
>>
>> On Aug 31, 2005, at 1:06 PM, Jeff Squyres wrote:
>>
>>> On Aug 29, 2005, at 9:17 PM, Brad Penoff wrote:
>>>
>>>
>>>>> PML: Pretty much the same as it was described in the paper. Its
>>>>> interface is basically MPI semantics (i.e., it sits right under
>>>>> MPI_SEND and the rest).
>>>>>
>>>>> BTL: Byte Transfer Layer; it's the next generation of PTL. The
>>>>> BTL is
>>>>> much more simple than the PTL, and removes all vestigaes of any MPI
>>>>> semantics that still lived in the PTL. It's a very simple byte
>>>>> mover
>>>>> layer, intended to make it quite easy to implement new network
>>>>> interfaces.
>>>>>
>>>>
>>>> I was curious about what you meant by the removal of MPI
>>>> semantics. Do
>>>> you mean it simply has no notion of tags, ranks, etc? In other
>>>> words,
>>>> does it simply put the data into some sort of format so that the
>>>> PML can
>>>> operate on with its own state machine?
>>>>
>>>
>>> I don't recall the details (it's been quite a while since I looked at
>>> the PTL), but there was some semblance of MPI semantics that creeped
>>> down into the PTL interface itself. The BTL has none of that -- it's
>>> purely a byte mover.
>>>
>>
>> The old ptl's controlled the short vs long rendezvous protocol, the
>> eager transmission of data, as well as pipelining of rdma operations
>> (where appropriate). In the pml OB1 and the btls this has all been
>> moved the OB1 level. Note that this is simply a logical separation of
>> control and comes at virtually no cost (well there is the very small
>> cost of using a function pointer).
>>
>>
>>>
>>>> Also, say you had some underlying protocol that allowed unordered
>>>> delivery
>>>> of data (so not fully ordered like TCP); which "layer" would the
>>>> notion of
>>>> "order" be handled in? I'm guessing PML would need some sort of
>>>> sequence
>>>> number attached to it; is that right?
>>>>
>>>
>>> Correct. That was in the PML in the 2nd gen stuff and is still at
>>> the PML in the 3rd gen stuff.
>>>
>>>
>>>>> BML: BTL Management Layer; this used to be part of the PML but we
>>>>> recently split it off into its own framework. It's mainly the
>>>>> utility
>>>>> gorp of managing multiple BTL modules in a single process. This
>>>>> was
>>>>> done because when working with the next generation of collectives,
>>>>> MPI-2 IO, and MPI-2 one sided operations, we want to have the
>>>>> ability
>>>>> to use the PML (which the collectives do today, for example) or
>>>>> to be
>>>>> able to dive right down and directly use the BTLs (i.e., cut out a
>>>>> little latency).
>>>>>
>>>>
>>>> In the cases where the BML is required, does it cost extra memcpy's?
>>>>
>>>
>>> Not to my knowledge. Galen -- can you fill in the details of this
>>> question and the rest of Brad's questions?
>>>
>> The BML layer is simply a management layer for discovering peer
>> resources. It does mask the btl send, put, prepare_src, prepare_dst
>> operations but this code is all inlined and very short so gcc should
>> inline this appropriately. In fact this inlined code used to be in
>> the PML OB1 before we added the BML so it is a no cost "logical"
>> abstraction. We don't add any extra memory copies in this
>> abstraction.
>>
>>> Thanks!
>>>
>>> --
>>> {+} Jeff Squyres
>>> {+} The Open MPI Project
>>> {+} http://www.open-mpi.org/
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel