Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI devel] IB pow wow notes
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-11-26 19:54:26

OMPI OF Pow Wow Notes
26 Nov 2007


Discussion of current / upcoming work:

OCG (Chelsio):
- Did a bunch of udapl work, but abandoned it. Will commit it to a
   tmp branch if anyone cares (likely not).
- They have been directed to move to the verbs API; will be starting
   on that next week.

- likely to get more involved in PML/BTL issues since Galen + Brian
   now transferring out of these areas.
- race between Chelsio / Cisco as to who implements RDMA CM connect PC
   first (more on this below). May involve some changes to the connect
   PC interface.
- Re-working libevent and progress engine stuff with George

- Andrew Friedley leaving LLNL in 3 weeks
- UD code more of less functional, but working on reliability stuff
   down in the BTL. That part is not quite working yet.
- When he leaves LLNL, UD BTL may become unmaintained.

- Has an interest in NUNAs
- May be interested in maintaining the UD BTL; worried about scale.

- Just finished first implementation of XRC
- Now working on QA issues with XRC: testing with multiple subnets,
   different numbers of HCAs/ports on different hosts, etc.

- Currently working full steam ahead on UDAPL.
- Will likely join in openib BTL/etc. when Sun's verbs stack is ready.
- Will *hopefully* get early access to Sun's verbs stack for testing /
   compatibility issues before the stack becomes final.

- Mostly working on non-PML/BTL issues these days.
- Galen's advice: get progress thread working for best IB overlap and
   real application performance.

- Working on XRC improvements
- Working on message coalescing. Only sees benefit if you drastically
   reduce the number of available buffers and credits -- i.e., be much
   more like openib BTL before BSRQ (2 buffer sizes: large and small,
   and have very few small buffer credits).
   <lots of discussion about message coalescing; this will be worth at
   least an FAQ item to explain all the trade-offs. There could be
   non-artificial benefits for coalescing at scale because of limiting
   the number of credits>
- Moving HCA initializing stuff to a common area to share it with
   collective components.


Discussion of various "moving forward" proposals:

- ORNL, Cisco, Mellanox discussing how to reduce cost of memory
   registration. Currently running some benchmarks to figure out where
   the bottlenecks are. Cheap registration would *help* (but not
   completely solve) overlap issues by reducing the number of sync
   points -- e.g., always just do one big RDMA operation (vs. the
   pipeline protocol).

- Some discussion of a UD-based connect PC. Gleb mentions that what
   was proposed is effectively the same as the IBTA CM (i.e., ibcm).
   Jeff will go investigate.

- Gleb also mentions that the connect PC needs to be based on the
   endpoint, not the entire module (for non-uniform hardware
   networks). Jeff took a to-do item to fix. Probably needs to be
   done for v1.3.

- To UD or not to UD? Lots of discussion on this.

   - Some data has been presented by OSU showing that UD drops don't
     happen often. Their tests were run in a large non-blocking
     network. Some in the group feel that in a busy blocking network,
     UD drops will be [much] more common (there is at least some
     evidence that in a busy non-blocking network, drops *are* rare).
     This issue affects how we design the recovery of UD drops: if
     drops are rare, recovery can be arbitrarily expensive. If drops
     are not rare, recovery needs to be at least somewhat efficient.
     If drops are frequent, recovery needs to be cheap/fast/easy.

   - Mellanox is investigating why ibv_rc_pingpong gives better
     bandwidth than ibv_ud_pingpong (i.e., UD bandwidth is poor).

   - Discuss the possibility of doing connection caching: only allow so
     many RC connections at a time. Maintain an LRU of RC connections.
     When you need to close one, also recycle (or free) all of its
     posted buffers.

   - Discussion of MVAPICH technique for large UD messages: "[receiver]
     zero copy UD". Send a match header; receiver picks a UD QP from a
     ready pool and sends it back to the sender. Fragments from the
     user's buffer are posted to that QP on the receiver, so the sender
     sends straight into the receiver's target buffer. This scheme
     assumes no drops. For OMPI, this scheme also requires more
     complexity from our current multi-device striping method: we'd
     want to stripe across large contiguous portions of the message
     (vs. round robining small fragments from the message).

   - One point specifically discussed: long message alltoall at scale
     (i.e., large numbers of MPI processes). Andrew Friedley is going
     to ask around LLNL if anyone does this, but our guess is no
     because each host would need a *ton* of RAM to do this:
     (num_procs_per_node * num_procs_total * length_of_buffer). Our
     suspicion is that alltoall for short messages is much more common
     (and still, by far, not the most common MPI collective).

   - One proposal:
     - Use UD for short messages (except for peers that switch to eager
     - Always use RC for long messages, potentially with connection
       caching+fast IB connect (ibcm?)

   - Another proposal: let OSU keep forging ahead with UD and see what
     they come up with. I.e., let them figure out if UD is worth it or

   - End result: it's not 100% clear that UD is a "win" yet -- there
     are many unanswered questions.

- Make new PML that is essentially "CM+matching", send entire messages
   down to lower layer instead of having the PML do the fragmenting:

   - Rationale:
     - pretty simple PML
     - allow lower layer to do more optimizations based on full
       knowledge of the specific network being used
     - networks get CM-like benefits without having to "natively"
       support shmem (because matching will still be done in the PML
       and there will be a lower layer/component for shmem)
     - [possibly] remove some stuff from current code in ob1 that is
       not necessary in IB/OF (Gleb didn't think that this would be
       useful; most of OB1 is there to support IB/OF)
     - not force other networks to same model as IB/OF (i.e., when we
       new things in IB/OF, we have to go change all the other BTLs)
       --> ^^ I forgot to mention this point on the call today
     - if we go towards a combined RC+UD OF protocol, the current OB1
       is not good at this because the BTL flags will have to "lie"
       about whether a given endpoint is capable of RDMA or not.
       --> Gleb mentioned that it doesn't matter what the PML thinks;
           even if the PML tells the BTL to RDMA PUT/GET, the BTL can
           emulate it if it isn't supported (e.g., if an endpoint
           switches between RD and UD)

   - Jeff sees this as a code re-org, not so much as a re-write.

   - Gleb is skeptical on the value of this; it may be more valuable if
     we go towards a blended UD+RC protocol, though.

The phone bridge started kicking people off at this point (after we
went 30+ minutes beyond the scheduled end time). So no conclusions
were reached. This discussion probably needs to continue in e-mail,

Jeff Squyres
Cisco Systems