Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-01-10 13:28:21

On Jan 10, 2008, at 11:55 AM, Jon Mason wrote:

>> BTW, I should point out that the modex CPC string list stuff is
>> currently somewhat wasteful in the presence of multiple ports on a
>> host. This will definitely be bad at scale. Specifically, we'll
>> send
>> around a CPC string in the openib modex for *each* port. This may be
>> repetitive (and wasteful at scale), especially if you have more than
>> one port/NIC of the same type in each host. This can cause the modex
>> size to increase quite a bit.
> While the message sent via modex is now longer, the number of messages
> sent is the same. So I would argue that this is only slight less
> optimal than the current implementation.

Not at scale.

Consider if someone has 2,000 8-core servers, each with a 2-port HCA.
Let's assume a full-machine run of 16,000 MPI processes, each who can
use 2 ports. Let's assume non-ConnectX HCAs to be conservative, so
they'll all be able to use the oob CPC (someday soon, RDMA CM and IBMC
will also be available, but let's start small).

Each of the 16k MPI procs will have "oob"+sizeof(uint32_t) twice in
their modex for a grand total of 14 extra bytes. No big deal on an
individual message, but consider that that's 16,000 * 14 = 224,000
extra bytes being gathered to mpirun.

Then consider that the whole pile of modex data is glommed together
and broadcast to each MPI process. Hence, we're now sending an extra
16,000 * 14 * 16,000 = 3,584,000,000 bytes sent across the network
during MPI_INIT (in addition to whatever is already being sent in the

Ralph's work on the new ORTE branch will help this quite a bit (with
the routed oob stuff -- sending modex messages only once to each node,
vs. once to each process), but still, the numbers are large:

- gather phase: 16,000 * 14 = 224,000 extra bytes
- scatter phase: 16,000 * 14 * 2,000 = 448,000 extra bytes

This is much more manageable, but still -- we should be careful when
we can.

Switching to hashed names and index lists will save quite a bit. For
example, if we do a dumb hash of the cpc name down to 1 byte and send
index lists of which ports use each cpc (each index can be 1 byte --
leading to a max of 256 ports in each host, which is probably
sufficient for the forseeable future!), we're down to 3 extra bytes in
the modex which is much more manageable:

in today's non-routed OOB:
- gather phase: 16,000 * 3 = 48,000 extra bytes
- scatter phase: 16,000 * 3 * 16,000 = 768,000,000 extra bytes

in the soon-to-be per-host modex distribution:
- gather phase: 16,000 * 3 = 48,000 extra bytes
- scatter phase: 16,000 * 3 * 2,000 = 96,000,000 extra bytes

Additionally, the routed oob makes the reality even better than that,
because it uses a tree distribution for the modex. So although the
raw number of bytes is the same as a per-host-but-not-routed modex
distribution, the distribution is quite wide, potentially avoiding
network congestion (because different ports/links/servers are
involved, all in parallel).

Jeff Squyres
Cisco Systems