Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: changes to modex
From: Tim Prins (tprins_at_[hidden])
Date: 2008-04-15 09:17:15

Hate to bring this up again, but I was thinking that an easy way to
reduce the size of the modex would be to reduce the length of the names
describing each piece of data.

More concretely, for a simple run I get the following names, each of
which are sent over the wire for every proc (note that this will change
depending on the number of btls one has active):

So that's 89 bytes of naming overhead (size of strings + dss packing
overhead) per process.

A couple of possible solutions to this:
1. Send a 32 bit string hashes instead of the strings. This would reduce
the per process size from 89 to 20 bytes, but there is always a (slight)
possibility of collisions.

2. Change the way the dss packs strings. Currently, it packs a 32 bit
sting length, the string, and a null terminator. It may be good enough
to just pack the string a the NULL terminator. This would reduce
per-process size from 89 to 69 bytes.

3. Reduce the length of the names. 'ompi-proc-info' could become simply
'pinf', and two of the separators could be removed in the other names
(ex: 'btl.openib.1.3' -> 'btlopenib1.3'). This would change the per
process size from 89 to 71 bytes.

4. Do #2 & #3. This would change the per process size from 89 to 51 bytes.

Anyways, just an idea for consideration.


> WHAT: Changes to MPI layer modex API
> WHY: To be mo' betta scalable
> WHERE: ompi/mpi/runtime/ompi_module_exchange.* and everywhere that
> calls ompi_modex_send() and/or ompi_modex_recv()
> TIMEOUT: COB Fri 4 Apr 2008
> Per some of the scalability discussions that have been occurring (some
> on-list and some off-list), and per the e-mail I sent out last week
> about ongoing work in the openib BTL, Ralph and I put together a loose
> proposal this morning to make the modex more scalable. The timeout is
> fairly short because Ralph wanted to start implementing in the near
> future, and we didn't anticipate that this would be a contentious
> proposal.
> The theme is to break the modex into two different kinds of data:
> - Modex data that is specific to a given proc
> - Modex data that is applicable to all procs on a given node
> For example, in the openib BTL, the majority of modex data is
> applicable to all processes on the same node (GIDs and LIDs and
> whatnot). It is much more efficient to send only one copy of such
> node-specific data to each process (vs. sending ppn copies to each
> process). The spreadsheet I included in last week's e-mail clearly
> shows this.
> 1. Add new modex API functions. The exact function signatures are
> TBD, but they will be generally of the form:
> * int ompi_modex_proc_send(...): send modex data that is specific to
> this process. It is just about exactly the same as the current API
> call (ompi_modex_send).
> * int ompi_modex_proc_recv(...): receive modex data from a specified
> peer process (indexed on ompi_proc_t*). It is just about exactly the
> same as the current API call (ompi_modex_recv).
> * int ompi_modex_node_send(...): send modex data that is relevant
> for all processes in this job on this node. It is intended that only
> one process in a job on a node will call this function. If more than
> one process in a job on a node calls _node_send(), then only one will
> "win" (meaning that the data sent by the others will be overwritten).
> * int ompi_modex_node_recv(...): receive modex data that is relevant
> for a whole peer node; receive the ["winning"] blob sent by
> _node_send() from the source node. We haven't yet decided what the
> node index will be; it may be (ompi_proc_t*) (i.e., _node_recv() would
> figure out what node the (ompi_proc_t*) resides on and then give you
> the data).
> 2. Make the existing modex API calls (ompi_modex_send,
> ompi_modex_recv) be wrappers around the new "proc" send/receive
> calls. This will provide exactly the same functionality as the
> current API (but be sub-optimal at scale). It will give BTL authors
> (etc.) time to update to the new API, potentially taking advantage of
> common data across multiple processes on the same node. We'll likely
> put in some opal_output()'s in the wrappers to help identify code that
> is still calling the old APIs.
> 3. Remove the old API calls (ompi_modex_send, ompi_modex_recv) before
> v1.3 is released.