Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: changes to modex
From: Tim Prins (tprins_at_[hidden])
Date: 2008-04-02 11:10:08

Is there a reason to rename ompi_modex_{send,recv} to
ompi_modex_proc_{send,recv}? It seems simpler (and no more confusing and
less work) to leave the names alone and add ompi_modex_node_{send,recv}.

Another question: Does the receiving process care that the information
received applies to a whole node? I ask because maybe we could get the
same effect by simply adding a parameter to ompi_modex_send which
specifies if the data applies to just the proc or a whole node.

So, if we have ranks 1 & 2 on n1, and rank 3 on n2, then rank 1 would do:
ompi_modex_send("arch", arch, <applies to whole node>);
then rank 3 would do:
ompi_modex_recv(rank 1, "arch");
ompi_modex_recv(rank 2, "arch");

I don't really care either way, just wanted to throw out the idea.


Jeff Squyres wrote:
> WHAT: Changes to MPI layer modex API
> WHY: To be mo' betta scalable
> WHERE: ompi/mpi/runtime/ompi_module_exchange.* and everywhere that
> calls ompi_modex_send() and/or ompi_modex_recv()
> TIMEOUT: COB Fri 4 Apr 2008
> Per some of the scalability discussions that have been occurring (some
> on-list and some off-list), and per the e-mail I sent out last week
> about ongoing work in the openib BTL, Ralph and I put together a loose
> proposal this morning to make the modex more scalable. The timeout is
> fairly short because Ralph wanted to start implementing in the near
> future, and we didn't anticipate that this would be a contentious
> proposal.
> The theme is to break the modex into two different kinds of data:
> - Modex data that is specific to a given proc
> - Modex data that is applicable to all procs on a given node
> For example, in the openib BTL, the majority of modex data is
> applicable to all processes on the same node (GIDs and LIDs and
> whatnot). It is much more efficient to send only one copy of such
> node-specific data to each process (vs. sending ppn copies to each
> process). The spreadsheet I included in last week's e-mail clearly
> shows this.
> 1. Add new modex API functions. The exact function signatures are
> TBD, but they will be generally of the form:
> * int ompi_modex_proc_send(...): send modex data that is specific to
> this process. It is just about exactly the same as the current API
> call (ompi_modex_send).
> * int ompi_modex_proc_recv(...): receive modex data from a specified
> peer process (indexed on ompi_proc_t*). It is just about exactly the
> same as the current API call (ompi_modex_recv).
> * int ompi_modex_node_send(...): send modex data that is relevant
> for all processes in this job on this node. It is intended that only
> one process in a job on a node will call this function. If more than
> one process in a job on a node calls _node_send(), then only one will
> "win" (meaning that the data sent by the others will be overwritten).
> * int ompi_modex_node_recv(...): receive modex data that is relevant
> for a whole peer node; receive the ["winning"] blob sent by
> _node_send() from the source node. We haven't yet decided what the
> node index will be; it may be (ompi_proc_t*) (i.e., _node_recv() would
> figure out what node the (ompi_proc_t*) resides on and then give you
> the data).
> 2. Make the existing modex API calls (ompi_modex_send,
> ompi_modex_recv) be wrappers around the new "proc" send/receive
> calls. This will provide exactly the same functionality as the
> current API (but be sub-optimal at scale). It will give BTL authors
> (etc.) time to update to the new API, potentially taking advantage of
> common data across multiple processes on the same node. We'll likely
> put in some opal_output()'s in the wrappers to help identify code that
> is still calling the old APIs.
> 3. Remove the old API calls (ompi_modex_send, ompi_modex_recv) before
> v1.3 is released.