On Jul 30, 2014, at 5:49 PM, George Bosilca <bosilca@icl.utk.edu> wrote:

On Jul 30, 2014, at 20:37 , Ralph Castain <rhc@open-mpi.org> wrote:

On Jul 30, 2014, at 5:25 PM, George Bosilca <bosilca@icl.utk.edu> wrote:

On Jul 30, 2014, at 18:00 , Jeff Squyres (jsquyres) <jsquyres@cisco.com> wrote:

WHAT: Should we make the job size (i.e., initial number of procs) available in OPAL?

WHY: At least 2 BTLs are using this info (*more below)

WHERE: usnic and ugni

TIMEOUT: there's already been some inflammatory emails about this; let's discuss next Tuesday on the teleconf: Tue, 5 Aug 2014


This is an open question.  We *have* the information at the time that the BTLs are initialized: do we allow that information to go down to OPAL?

Ralph added this info down in OPAL in r32355, but George reverted it in r32361.

Points for: YES, WE SHOULD
+++ 2 BTLs were using it (usinc, ugni)
+++ Other RTE job-related info are already in OPAL (num local ranks, local rank)

Points for: NO, WE SHOULD NOT
--- What exactly is this number (e.g., num currently-connected procs?), and when is it updated?
--- We need to precisely delineate what belongs in OPAL vs. above-OPAL
--- Using this information to configure the communication environment limits the scope of communication substrate to a static application (in number of participants). Under this assumption, one can simply wait until the first add_proc to compute the number of processes, solution as “correct” as the current one.

Not necessarily - it depends on how it is used, and how it is communicated. Some of us have explored other options for using this that aren’t static, but where the info is of use.

This is a little bit too much hand waving to be constructive. Some other folks in the field have developed many communications libraries, and none of them needed a random number of potential processes to initialize themselves correctly.

That's fine - everyone innovates and does something new. I'm not about to divulge proprietary, competitive info to you in advance just to justify our needs. I'll only note that notification of change isn't the sole jurisdiction of the FT group, and some of us have other uses for it.

The other “global” information that were made available in OPAL (num_local_peers and my_local_rank) are only used by local BTL (SM, SMCUDA and VADER). Moreover, my_local_rank is only used to decide who initialize the backend file, thing that can easily be done using an atomic operation. The number of local processes is used to prevent SM from activating itself if we don’t have at least 2 processes per node. So, their usage is minimally invasive, and can eventually be phased out with a little effort.

FWIW: the new PMI abstraction is in OPAL because it is RTE-agnostic. So all the info being discussed will actually be captured originally in the OPAL layer,  and stored in the OPAL dstore framework. In the current code, the RTE grabs the data and exposes it to the OMPI layer, which then pushes it back down to the OPAL proc.h struct.

<shrug> since anyone can freely query the info from opal/pmix or opal/dstore, it is really irrelevant in some ways. The info is there, in the OPAL layer, prior to BTL's being initialized. If you don't want it in a global storage, people can just get it from the appropriate OPAL API.

So what are we actually debating here? Global storage vs API call?

Our goals in this project are clearly orthogonal. I put a lot of effort into this move because I need to use the BTLs without PMI, without RTE.

And you are certainly free to do so. Nobody is putting a gun to your head and demanding that your BTLs use it

In fact the question boils down to: Do you want to be able to use the BTL to bootstrap the RTE or not? If yes, then the number of processes is out of the picture, either as an API or as a global storage.

Yes, I do - and no, it isn't a black/white question. I can use the BTLs to bootstrap just fine, even when someone uses that info for an initial optimization. I can always notify them later when things change, and they can make adjustments if necessary.

Again, nobody is forcing you to use any of the data in the opal dstore. It is just there if someone *wants* to use it. I fail to understand why you want to tell everyone else what they can do in their BTL. If you don't like how they wrote it, you are always free to write your own version of it. Nobody will stop you.

So what is the issue here?



FWIW: here's how ompi_process_info.num_procs was used before the BTL move down to OPAL:

- usnic: for a minor latency optimization / sizing of a shared receive buffer queue length, and for the initial size of a peer lookup hash
- ugni: to determine the size of the per-peer buffers used for send/recv communication

Jeff Squyres
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/

devel mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15373.php

devel mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15378.php

devel mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15379.php

devel mailing list
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: http://www.open-mpi.org/community/lists/devel/2014/07/15381.php