Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: usnic BTL MPI_T pvar scheme
From: George Bosilca (bosilca_at_[hidden])
Date: 2013-11-05 17:59:01


I like the idea. I do have some question not necessarily related to
your proposal, but to how we can use the information you propose to
expose.

I have a question regarding the extension of this concept to multi-BTL
runs. Granted we will have to have a local indexing of BTL (I'm not
concerned about this). But how do we ensure the naming is globally
consistent (in the sense that all processes in the job will agree that
usnic0 is index 0) even when we have a heterogeneous environment? As
an example some of our clusters have 1 NIC on some nodes, and 2 on
others. Of course we can say we don't guarantee consistent naming, but
for tools trying to understand communication issues on distributed
environments having a global view is a clear plus.

Another question is about the level of details. I wonder if this level
of details is really needed, or providing the aggregate pvar will be
enough in most cases. The problem I see here is the lack of
topological knowledge at the upper level. Seeing a large number of
messages on a particular BTL might suggest that something is wrong
inside the implementation, when in fact the BTL is the only one
connecting a subset of peers. Without us exposing this information,
I'm afraid the tool might get the wrong picture ...

Thanks,
  George.

On Tue, Nov 5, 2013 at 11:37 PM, Jeff Squyres (jsquyres)
<jsquyres_at_[hidden]> wrote:
> WHAT: suggestion for how to expose multiple MPI_T pvar values for a given variable.
>
> WHY: so that we have a common convention across OMPI (and possibly set a precedent for other MPI implementations...?).
>
> WHERE: ompi/mca/btl/usnic, but if everyone likes it, potentially elsewhere in OMPI
>
> TIMEOUT: before 1.7.4, so let's set a first timeout of next Tuesday teleconf (Nov 12)
>
> More detail:
> ------------
>
> Per my discussion on the call today, I'm sending the attached PPT of how we're exposing MPI_T performance variables in the usnic BTL in the multi-BTL case.
>
> Feedback is welcome, especially because we're the first MPI implementation to expose MPI_T pvars in this way (already committed on the trunk and targeted for 1.7.4). So this methodology may well become a useful precedent.
>
> ** Issue #1: we want to expose each usnic BTL pvar (e.g., btl_usnic_num_sends) on a per-usnic-BTL-*module* basis. How to do this?
>
> 1. Add a prefix/suffix on each pvar name (e.g., btl_usnic_num_sends_0, btl_usnic_num_sends_1, ...etc.).
> 2. Return an array of values under the single name (btl_usnic_num_sends) -- one value for each BTL module.
>
> We opted for the 2nd option. The MPI_T pvar interface provides a way to get the array length for a pvar, so this is all fine and good.
>
> Specifically: btl_usnic_num_sends returns an array of N values, where N is the number of usnic BTL modules being used by the MPI process. Each slot in the array corresponds to the value from one usnic BTL module.
>
> ** Issue #2: but how do you map a given value to an underlying Linux usnic interface?
>
> Our solution was twofold:
>
> 1. Guarantee that the ordering of values in all pvar arrays is the same (i.e., usnic BTL module 0 will always be in slot 0, usnic BTL module 1 will always be in slot 1, ...etc.).
>
> 2. Add another pvar that is an MPI_T state variable with an associated MPI_T "enumeration", which contains string names of the underlying Linux devices. This allows you to map a given value from a pvar to an underlying Linux device (e.g., from usnic BTL module 2 to /dev/usnic_3, or whatever).
>
> See the attached PPT.
>
> If people have no objection to this, we should use this convention across OMPI (e.g., for other BTLs that expose MPI_T pvars).
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel