Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Open MPI BTL meeting in Knoxville
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-03-05 10:33:08


Sorry it took so long to forward these notes to everyone. Here's some notes from the BTL meeting we had in Knoxville a few weeks ago.

> Date:
> Feb. 12, 2013
>
> People:
> Thomas Herault
> George Bosilca
> Jeff Squyres
> Brian Barrett
> Aurelien Bouteiller
> Ralf Castain
> Nathan Hjelmn
>
> Goal:
> Lay out the general design of moving the BTL framework into OPAL.
>
>
> -== Identifying dependencies ==-
>
> BTL
> +------> Modex
> +------> Mpool + rcache + conv
> +------> bml / allocators
> +------> Help/*
> +------> Naming + Endpoints?
> +------> (RML/OOB)
> +------> Threads
>
>
> ==== ACTION PLAN ====
>
> 0. Remove Solaris Threads (--with-thread option is attached)
> 1. Opal DB/modex
> 1.b OpenIB UDCM independent from OOB
> 2. Move BTL down to OPAL
> 3. Move to locks to lowercase versions (that are always locking), look at perf.
> 4. Look at conditions, atomics, etc
> 4.5: add big locks on things that are maybe not thread safe and not performance critical
> 5. Fix perf/redesign locking (in SM, in particular)
> 6. Use BTL tcp in place of OOB in ORTE
>
>
> ==== DETAIL OF ISSUES ====
>
> -== IB BTL boostrapping ==-
>
> IB BTL is the only one that depends on OOB/RML
> Options:
> 1. Use the TCP BTL to boostrap IB BTL. Brian doesn't like this, because
> making it available is an enabler for bad practice that will creep in
> the codebase
> 2. Remove OOB, Fix UDCM so it stops doing things it should not have
> done anyway.
> We settle for option 2.
>
> -== SM initialization ==-
>
> Some technical discussion on the way the shared segment is created and
> the sync mechanism for the shared file. There are a number of issues,
> that seem to benefit from the fact that the modex synchronize before
> we attempt the file access. There may be trouble if the modex is
> removed (or is not synchronizing).
>
> -== Process name scalability ==-
>
> Process names use a lot of space.
> Do we need the process information from everybody at all time ?
>
> Modex vs opal_db. (need to clarify, I was doing something else)
>
> Too many things are going into the modex/db. In many arch, we don't need
> the hostname, or other info, because they can be derived. Some other
> machines, the hostname has no meaning.
>
> Brian: BTL should not have the hostname - at all - ? BTL should not
> report errors themselves, errors should go up and the BTL stay silent
> (also avoids some massive multinode error logs).
> * error is reported upstack (no printf)
> * Callback to get the error string later, when the pretty print happens
>
> ***
> We need a temporary name during bootstrapping (before we get the OMPI
> names setup). Could be created from a 128bit hash, it should have low
> probability of collision, we can crash the job if we detect collision (?).
>
> ompi_name is some sort of proxy for orte_name. Everytime we use a
> ompi_name, it gets converted to orte_name immediately after.
>
> We also need an identifier to prevent random stuff to connect to us.
> There is an issue in dynamic process, for names can still be unknown
> yet. DPM is expensive and saved by a modex, we'll have a problem later
> on if we make it fast.
>
> ***
> When does the BTL need a name first ?
>
> opal_init, opal_init_util (from ompi_info)
> add? opal_init_btl(name) + opal_fini_btl()
>
> During opal_init_btl:
> * Btls register with the name
> * Add their local info to the DB (opal_db)
> use a hashtable for storing name{ key=value ... }
> Align by 64bits the values, so that all keys are allocated (and sent)
> in a single bulk.
> * Should some modex key appear as global shared, local shared, local,
> (lazy propagated?)
> *
>
> -== BML ==-
>
> We should not care about it and not move it around. We are fine using
> BTL only, the BML offers little functionality. We'll try, but if it
> is hard we'll forget it.
>
> * Addprocs: Assumption that we have to call addprocs for each
> endpoint. Maybe we can change this so that addprocs is called only once.
> * If orte uses BTL, it will have to be called twice, that is sorry
> (or not ?). It can be postponed for when ORTE moves to BTL.
>
> -== Active Message TAG numbers ==-
>
> They have to move down too. The split in 32bits groups makes the tags
> sparse. We have a layer separation break here, but we may not want to
> have all PML_OB1 tags appear down in the OPAL. We'll put down the header
> file, we don't change it for now (we are not overcrowded so its ok for
> it to be sparse). George will rearrange so that there are more possible
> families (at the expense of the number of possible tags per families).
>
> -== Thread safety ==-
>
> Because BTL are now being used by top layers that we don't know what
> they are doing, we have to assume that threads are on (by default),
> leading to a bad performance hit, due to using opal_list_t that are
> locked deep inside.
>
> What needs to be protected ?
> * per btl locking: huge cost on everything
> * per endpoint locking?
>
> That's related to enabling async progress, but that is a big chunk of
> work. We just want to keep in mind that goal so that we don't make it
> worse than it already is.
>
> Do we want the ability to turn off/on thread safety at runtime ?
> * lock, unlock, trylock:
> * accessors that are always safe
> * accessors that can be turned unsafe at runtime (only for OMPI level)
> * swap, cmpswap, substract, add (32, 64bit atomics): no change (we already have both), but the CAPS move to OMPI
> * signal and condition variables
> * When, how do we call progress when needed if we remove the
> UPPER_CASE that calls it ?
> * appears in wait, test, free_list
> * We'll try to remove as many as we can (upper case locks) and see
> where we are in 4-5 months from now.
>
> ===== OTHER issues found while talking =====
>
> * DPM is slow, it needs a speedup.
>
> -== Error reporting / printf ==-
> Replace all orte_show_help with opal_show_help. Make sure that the
> symbol is not exposed anymore outside ORTE to force update.
> * Orte_show_help, deduplicating happens in orte, ompi_show_help backcalls
> orte_show_help. Lets get rid of it completely.

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/