Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: changes to modex
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-04-02 12:03:17


On Apr 2, 2008, at 11:10 AM, Tim Prins wrote:
> Is there a reason to rename ompi_modex_{send,recv} to
> ompi_modex_proc_{send,recv}? It seems simpler (and no more confusing
> and
> less work) to leave the names alone and add
> ompi_modex_node_{send,recv}.

If the arguments don't change, I don't have a strong objection to
leaving the names alone. I think the rationale for a new names is:

- the arguments may change
- completely clear names, and good symmetry with *_node_* and *_proc_*

If the args change, then I think it is best to use new names so that
BTL authors (etc.) have time to adapt. If not, then I minorly prefer
the new names, but don't care too much.

> Another question: Does the receiving process care that the information
> received applies to a whole node? I ask because maybe we could get the
> same effect by simply adding a parameter to ompi_modex_send which
> specifies if the data applies to just the proc or a whole node.
>
> So, if we have ranks 1 & 2 on n1, and rank 3 on n2, then rank 1
> would do:
> ompi_modex_send("arch", arch, <applies to whole node>);
> then rank 3 would do:
> ompi_modex_recv(rank 1, "arch");
> ompi_modex_recv(rank 2, "arch");

I'm not sure I understand what you mean. Proc 3 would get the one
blob that was sent from proc 1?

In the openib btl, I'll likely have both node and proc portions to send.

>
> I don't really care either way, just wanted to throw out the idea.
>
> Tim
>
> Jeff Squyres wrote:
>> WHAT: Changes to MPI layer modex API
>>
>> WHY: To be mo' betta scalable
>>
>> WHERE: ompi/mpi/runtime/ompi_module_exchange.* and everywhere that
>> calls ompi_modex_send() and/or ompi_modex_recv()
>>
>> TIMEOUT: COB Fri 4 Apr 2008
>>
>> DESCRIPTION:
>>
>> Per some of the scalability discussions that have been occurring
>> (some
>> on-list and some off-list), and per the e-mail I sent out last week
>> about ongoing work in the openib BTL, Ralph and I put together a
>> loose
>> proposal this morning to make the modex more scalable. The timeout
>> is
>> fairly short because Ralph wanted to start implementing in the near
>> future, and we didn't anticipate that this would be a contentious
>> proposal.
>>
>> The theme is to break the modex into two different kinds of data:
>>
>> - Modex data that is specific to a given proc
>> - Modex data that is applicable to all procs on a given node
>>
>> For example, in the openib BTL, the majority of modex data is
>> applicable to all processes on the same node (GIDs and LIDs and
>> whatnot). It is much more efficient to send only one copy of such
>> node-specific data to each process (vs. sending ppn copies to each
>> process). The spreadsheet I included in last week's e-mail clearly
>> shows this.
>>
>> 1. Add new modex API functions. The exact function signatures are
>> TBD, but they will be generally of the form:
>>
>> * int ompi_modex_proc_send(...): send modex data that is specific to
>> this process. It is just about exactly the same as the current API
>> call (ompi_modex_send).
>>
>> * int ompi_modex_proc_recv(...): receive modex data from a specified
>> peer process (indexed on ompi_proc_t*). It is just about exactly the
>> same as the current API call (ompi_modex_recv).
>>
>> * int ompi_modex_node_send(...): send modex data that is relevant
>> for all processes in this job on this node. It is intended that only
>> one process in a job on a node will call this function. If more than
>> one process in a job on a node calls _node_send(), then only one will
>> "win" (meaning that the data sent by the others will be overwritten).
>>
>> * int ompi_modex_node_recv(...): receive modex data that is relevant
>> for a whole peer node; receive the ["winning"] blob sent by
>> _node_send() from the source node. We haven't yet decided what the
>> node index will be; it may be (ompi_proc_t*) (i.e., _node_recv()
>> would
>> figure out what node the (ompi_proc_t*) resides on and then give you
>> the data).
>>
>> 2. Make the existing modex API calls (ompi_modex_send,
>> ompi_modex_recv) be wrappers around the new "proc" send/receive
>> calls. This will provide exactly the same functionality as the
>> current API (but be sub-optimal at scale). It will give BTL authors
>> (etc.) time to update to the new API, potentially taking advantage of
>> common data across multiple processes on the same node. We'll likely
>> put in some opal_output()'s in the wrappers to help identify code
>> that
>> is still calling the old APIs.
>>
>> 3. Remove the old API calls (ompi_modex_send, ompi_modex_recv) before
>> v1.3 is released.
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems