Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: usnic BTL MPI_T pvar scheme
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-11-05 21:06:15

On Nov 5, 2013, at 2:59 PM, George Bosilca <bosilca_at_[hidden]> wrote:

> I have a question regarding the extension of this concept to multi-BTL
> runs. Granted we will have to have a local indexing of BTL (I'm not
> concerned about this). But how do we ensure the naming is globally
> consistent (in the sense that all processes in the job will agree that
> usnic0 is index 0) even when we have a heterogeneous environment?

The MPI_T pvars are local-only. So even if index 0 is usnic_0 in proc A, but index 0 is usnic_3 in proc B, it shouldn't matter. More specifically: these values only have meaning within the process from which they were gathered.

I guess I'm trying to say that there's no need to ensure globally consistent ordering between processes. ...unless I'm missing something?

> As
> an example some of our clusters have 1 NIC on some nodes, and 2 on
> others. Of course we can say we don't guarantee consistent naming, but
> for tools trying to understand communication issues on distributed
> environments having a global view is a clear plus.

A good point. But even with globally consistent ordering, you don't know that usnic_0 in process A communicates with usnic_0 in process B (indeed, we run some QA cases here at Cisco where we deliberately ensure that usnic_X in process A is on the same subnet as usnic_Y in process B, where X!=Y, and everything still works properly).

> Another question is about the level of details. I wonder if this level
> of details is really needed, or providing the aggregate pvar will be
> enough in most cases. The problem I see here is the lack of
> topological knowledge at the upper level. Seeing a large number of
> messages on a particular BTL might suggest that something is wrong
> inside the implementation, when in fact the BTL is the only one
> connecting a subset of peers. Without us exposing this information,
> I'm afraid the tool might get the wrong picture ...

I think exposing network-level information can only be used to infer indirect information about the upper-layer MPI semantics. However, exposing these counters was not intended to be used for MPI-application-level semantic information; it was more intended to expose information about what is happening on your underlying network -- something that OS bypass networks don't otherwise provide.

Jeff Squyres
For corporate legal information go to: