Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: [OMPI devel] Open MPI BTL meeting in Knoxville
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-03-05 10:33:08

Sorry it took so long to forward these notes to everyone. Here's some notes from the BTL meeting we had in Knoxville a few weeks ago.

> Date:
> Feb. 12, 2013
> People:
> Thomas Herault
> George Bosilca
> Jeff Squyres
> Brian Barrett
> Aurelien Bouteiller
> Ralf Castain
> Nathan Hjelmn
> Goal:
> Lay out the general design of moving the BTL framework into OPAL.
> -== Identifying dependencies ==-
> +------> Modex
> +------> Mpool + rcache + conv
> +------> bml / allocators
> +------> Help/*
> +------> Naming + Endpoints?
> +------> (RML/OOB)
> +------> Threads
> ==== ACTION PLAN ====
> 0. Remove Solaris Threads (--with-thread option is attached)
> 1. Opal DB/modex
> 1.b OpenIB UDCM independent from OOB
> 2. Move BTL down to OPAL
> 3. Move to locks to lowercase versions (that are always locking), look at perf.
> 4. Look at conditions, atomics, etc
> 4.5: add big locks on things that are maybe not thread safe and not performance critical
> 5. Fix perf/redesign locking (in SM, in particular)
> 6. Use BTL tcp in place of OOB in ORTE
> ==== DETAIL OF ISSUES ====
> -== IB BTL boostrapping ==-
> IB BTL is the only one that depends on OOB/RML
> Options:
> 1. Use the TCP BTL to boostrap IB BTL. Brian doesn't like this, because
> making it available is an enabler for bad practice that will creep in
> the codebase
> 2. Remove OOB, Fix UDCM so it stops doing things it should not have
> done anyway.
> We settle for option 2.
> -== SM initialization ==-
> Some technical discussion on the way the shared segment is created and
> the sync mechanism for the shared file. There are a number of issues,
> that seem to benefit from the fact that the modex synchronize before
> we attempt the file access. There may be trouble if the modex is
> removed (or is not synchronizing).
> -== Process name scalability ==-
> Process names use a lot of space.
> Do we need the process information from everybody at all time ?
> Modex vs opal_db. (need to clarify, I was doing something else)
> Too many things are going into the modex/db. In many arch, we don't need
> the hostname, or other info, because they can be derived. Some other
> machines, the hostname has no meaning.
> Brian: BTL should not have the hostname - at all - ? BTL should not
> report errors themselves, errors should go up and the BTL stay silent
> (also avoids some massive multinode error logs).
> * error is reported upstack (no printf)
> * Callback to get the error string later, when the pretty print happens
> ***
> We need a temporary name during bootstrapping (before we get the OMPI
> names setup). Could be created from a 128bit hash, it should have low
> probability of collision, we can crash the job if we detect collision (?).
> ompi_name is some sort of proxy for orte_name. Everytime we use a
> ompi_name, it gets converted to orte_name immediately after.
> We also need an identifier to prevent random stuff to connect to us.
> There is an issue in dynamic process, for names can still be unknown
> yet. DPM is expensive and saved by a modex, we'll have a problem later
> on if we make it fast.
> ***
> When does the BTL need a name first ?
> opal_init, opal_init_util (from ompi_info)
> add? opal_init_btl(name) + opal_fini_btl()
> During opal_init_btl:
> * Btls register with the name
> * Add their local info to the DB (opal_db)
> use a hashtable for storing name{ key=value ... }
> Align by 64bits the values, so that all keys are allocated (and sent)
> in a single bulk.
> * Should some modex key appear as global shared, local shared, local,
> (lazy propagated?)
> *
> -== BML ==-
> We should not care about it and not move it around. We are fine using
> BTL only, the BML offers little functionality. We'll try, but if it
> is hard we'll forget it.
> * Addprocs: Assumption that we have to call addprocs for each
> endpoint. Maybe we can change this so that addprocs is called only once.
> * If orte uses BTL, it will have to be called twice, that is sorry
> (or not ?). It can be postponed for when ORTE moves to BTL.
> -== Active Message TAG numbers ==-
> They have to move down too. The split in 32bits groups makes the tags
> sparse. We have a layer separation break here, but we may not want to
> have all PML_OB1 tags appear down in the OPAL. We'll put down the header
> file, we don't change it for now (we are not overcrowded so its ok for
> it to be sparse). George will rearrange so that there are more possible
> families (at the expense of the number of possible tags per families).
> -== Thread safety ==-
> Because BTL are now being used by top layers that we don't know what
> they are doing, we have to assume that threads are on (by default),
> leading to a bad performance hit, due to using opal_list_t that are
> locked deep inside.
> What needs to be protected ?
> * per btl locking: huge cost on everything
> * per endpoint locking?
> That's related to enabling async progress, but that is a big chunk of
> work. We just want to keep in mind that goal so that we don't make it
> worse than it already is.
> Do we want the ability to turn off/on thread safety at runtime ?
> * lock, unlock, trylock:
> * accessors that are always safe
> * accessors that can be turned unsafe at runtime (only for OMPI level)
> * swap, cmpswap, substract, add (32, 64bit atomics): no change (we already have both), but the CAPS move to OMPI
> * signal and condition variables
> * When, how do we call progress when needed if we remove the
> UPPER_CASE that calls it ?
> * appears in wait, test, free_list
> * We'll try to remove as many as we can (upper case locks) and see
> where we are in 4-5 months from now.
> ===== OTHER issues found while talking =====
> * DPM is slow, it needs a speedup.
> -== Error reporting / printf ==-
> Replace all orte_show_help with opal_show_help. Make sure that the
> symbol is not exposed anymore outside ORTE to force update.
> * Orte_show_help, deduplicating happens in orte, ompi_show_help backcalls
> orte_show_help. Lets get rid of it completely.

Jeff Squyres
For corporate legal information go to: