Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] IB pow wow notes
From: Richard Graham (rlgraham_at_[hidden])
Date: 2007-12-02 17:11:18


One question ­ there is a mention a new pml that is essentially CM+matching.
Why is this no just another instance of CM ?

Rich

On 11/26/07 7:54 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:

> OMPI OF Pow Wow Notes
> 26 Nov 2007
>
> ---------------------------------------------------------------------------
>
> Discussion of current / upcoming work:
>
> OCG (Chelsio):
> - Did a bunch of udapl work, but abandoned it. Will commit it to a
> tmp branch if anyone cares (likely not).
> - They have been directed to move to the verbs API; will be starting
> on that next week.
>
> Cisco:
> - likely to get more involved in PML/BTL issues since Galen + Brian
> now transferring out of these areas.
> - race between Chelsio / Cisco as to who implements RDMA CM connect PC
> first (more on this below). May involve some changes to the connect
> PC interface.
> - Re-working libevent and progress engine stuff with George
>
> LLNL:
> - Andrew Friedley leaving LLNL in 3 weeks
> - UD code more of less functional, but working on reliability stuff
> down in the BTL. That part is not quite working yet.
> - When he leaves LLNL, UD BTL may become unmaintained.
>
> IBM:
> - Has an interest in NUNAs
> - May be interested in maintaining the UD BTL; worried about scale.
>
> Mellanox:
> - Just finished first implementation of XRC
> - Now working on QA issues with XRC: testing with multiple subnets,
> different numbers of HCAs/ports on different hosts, etc.
>
> Sun:
> - Currently working full steam ahead on UDAPL.
> - Will likely join in openib BTL/etc. when Sun's verbs stack is ready.
> - Will *hopefully* get early access to Sun's verbs stack for testing /
> compatibility issues before the stack becomes final.
>
> ORNL:
> - Mostly working on non-PML/BTL issues these days.
> - Galen's advice: get progress thread working for best IB overlap and
> real application performance.
>
> Voltaire:
> - Working on XRC improvements
> - Working on message coalescing. Only sees benefit if you drastically
> reduce the number of available buffers and credits -- i.e., be much
> more like openib BTL before BSRQ (2 buffer sizes: large and small,
> and have very few small buffer credits).
> <lots of discussion about message coalescing; this will be worth at
> least an FAQ item to explain all the trade-offs. There could be
> non-artificial benefits for coalescing at scale because of limiting
> the number of credits>
> - Moving HCA initializing stuff to a common area to share it with
> collective components.
>
> ---------------------------------------------------------------------------
>
> Discussion of various "moving forward" proposals:
>
> - ORNL, Cisco, Mellanox discussing how to reduce cost of memory
> registration. Currently running some benchmarks to figure out where
> the bottlenecks are. Cheap registration would *help* (but not
> completely solve) overlap issues by reducing the number of sync
> points -- e.g., always just do one big RDMA operation (vs. the
> pipeline protocol).
>
> - Some discussion of a UD-based connect PC. Gleb mentions that what
> was proposed is effectively the same as the IBTA CM (i.e., ibcm).
> Jeff will go investigate.
>
> - Gleb also mentions that the connect PC needs to be based on the
> endpoint, not the entire module (for non-uniform hardware
> networks). Jeff took a to-do item to fix. Probably needs to be
> done for v1.3.
>
> - To UD or not to UD? Lots of discussion on this.
>
> - Some data has been presented by OSU showing that UD drops don't
> happen often. Their tests were run in a large non-blocking
> network. Some in the group feel that in a busy blocking network,
> UD drops will be [much] more common (there is at least some
> evidence that in a busy non-blocking network, drops *are* rare).
> This issue affects how we design the recovery of UD drops: if
> drops are rare, recovery can be arbitrarily expensive. If drops
> are not rare, recovery needs to be at least somewhat efficient.
> If drops are frequent, recovery needs to be cheap/fast/easy.
>
> - Mellanox is investigating why ibv_rc_pingpong gives better
> bandwidth than ibv_ud_pingpong (i.e., UD bandwidth is poor).
>
> - Discuss the possibility of doing connection caching: only allow so
> many RC connections at a time. Maintain an LRU of RC connections.
> When you need to close one, also recycle (or free) all of its
> posted buffers.
>
> - Discussion of MVAPICH technique for large UD messages: "[receiver]
> zero copy UD". Send a match header; receiver picks a UD QP from a
> ready pool and sends it back to the sender. Fragments from the
> user's buffer are posted to that QP on the receiver, so the sender
> sends straight into the receiver's target buffer. This scheme
> assumes no drops. For OMPI, this scheme also requires more
> complexity from our current multi-device striping method: we'd
> want to stripe across large contiguous portions of the message
> (vs. round robining small fragments from the message).
>
> - One point specifically discussed: long message alltoall at scale
> (i.e., large numbers of MPI processes). Andrew Friedley is going
> to ask around LLNL if anyone does this, but our guess is no
> because each host would need a *ton* of RAM to do this:
> (num_procs_per_node * num_procs_total * length_of_buffer). Our
> suspicion is that alltoall for short messages is much more common
> (and still, by far, not the most common MPI collective).
>
> - One proposal:
> - Use UD for short messages (except for peers that switch to eager
> RDMA)
> - Always use RC for long messages, potentially with connection
> caching+fast IB connect (ibcm?)
>
> - Another proposal: let OSU keep forging ahead with UD and see what
> they come up with. I.e., let them figure out if UD is worth it or
> not.
>
> - End result: it's not 100% clear that UD is a "win" yet -- there
> are many unanswered questions.
>
> - Make new PML that is essentially "CM+matching", send entire messages
> down to lower layer instead of having the PML do the fragmenting:
>
> - Rationale:
> - pretty simple PML
> - allow lower layer to do more optimizations based on full
> knowledge of the specific network being used
> - networks get CM-like benefits without having to "natively"
> support shmem (because matching will still be done in the PML
> and there will be a lower layer/component for shmem)
> - [possibly] remove some stuff from current code in ob1 that is
> not necessary in IB/OF (Gleb didn't think that this would be
> useful; most of OB1 is there to support IB/OF)
> - not force other networks to same model as IB/OF (i.e., when we
> want
> new things in IB/OF, we have to go change all the other BTLs)
> --> ^^ I forgot to mention this point on the call today
> - if we go towards a combined RC+UD OF protocol, the current OB1
> is not good at this because the BTL flags will have to "lie"
> about whether a given endpoint is capable of RDMA or not.
> --> Gleb mentioned that it doesn't matter what the PML thinks;
> even if the PML tells the BTL to RDMA PUT/GET, the BTL can
> emulate it if it isn't supported (e.g., if an endpoint
> switches between RD and UD)
>
> - Jeff sees this as a code re-org, not so much as a re-write.
>
> - Gleb is skeptical on the value of this; it may be more valuable if
> we go towards a blended UD+RC protocol, though.
>
> The phone bridge started kicking people off at this point (after we
> went 30+ minutes beyond the scheduled end time). So no conclusions
> were reached. This discussion probably needs to continue in e-mail,
> etc.
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>