Thanks Jeff. It appears to me that the first approach to reducing modex data
makes the most sense and has the largest impact - I would advocate pursuing
it first. We can look at further refinements later.
Along that line, one thing we also exchange in the modex (not IB specific)
is hostname and arch. This is in the ompi/proc/proc.c code. It seems to me
that this is also wasteful and can be removed. The daemons already have that
info for the job and can easily "drop" it into each proc - there is no
reason to send it around.
I'll take a look at cleaning that up, ensuring we don't "break" daemonless
environments, along with the other things underway.
On 3/28/08 11:37 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
> I've had this conversation independently with several people now, so
> I'm sending it to the list rather than continuing to have the same
> conversation over and over. :-)
> As most of you know, Jon and I are working on the new openib
> "CPC" (connect pseudo-component) stuff in /tmp-public/openib-cpc2.
> There are two main reasons for it:
> 1. Add support for RDMA CM (they need it for iWarp support)
> 2. Add support for IB CM (which will hopefully be a more scalable
> connect system as compared to the current RML/OOB-based method of
> making IB QPs)
> When complete, there will be 4 CPCs: RDMA CM, IB CM, OOB, and XOOB
> (same as OOB but with ConnectX XRC extensions).
> RDMA CM has some known scaling issues, and at least some known
> workarounds -- I won't discuss the merits/drawbacks of RDMA CM here.
> IB CM has unknown scaling characteristics, but seems to look good on
> paper (e.g., it uses UD for a 3-way handshake to make an IB QP).
> On the trunk, it's a per-MPI process decision as to which CPC you'll
> use. Per ticket #1191, one of the goals of the /tmp-public branch is
> to make CPC decision be a per-openib-BTL-module decision. So you can
> mix iWarp and IB hardware in a single host, for example. This fits in
> quite well with the "mpirun should work out of the box" philosophy of
> Open MPI.
> In the openib BTL, each BTL module is paired with a specific HCA/NIC
> (verbs) port. And depending on the interface hardware and software,
> one or more CPCs may be available for each. Hence, for each BTL
> module in each MPI process, we may send one or more CPC connect
> information blobs in the modex (note that the oob and xoob CPCs don't
> need to send anything additional in the modex).
> Jon and I are actually getting closer to completion on the branch, and
> it seems to be working.
> In conjunction with several other scalability discussions that are
> ongoing right now, several of us have toyed with two basic ideas to
> improve scalability of job launch / startup:
> 1. the possibility of eliminating the modex altogether (e.g., have
> ORTE dump enough information to each MPI process to figure out/
> calculate/locally lookup [in local files?] BTL addressing information
> for all peers in MPI_COMM_WORLD, etc.), a la Portals.
> 2. reducing the amount of data in the modex.
> One obvious idea for #2 is to have only one process on each host send
> all/the majority of openib BTL modex information for that host. The
> rationale here is that all MPI processes on a single host will share
> much of the same BTL addressing information, so why send it N times?
> Local rank 0 can modex send all/the majority of the modex for the
> openib BTL modules; local ranks 1-N can either send nothing or a
> [very] small piece of differentiating information (e.g., IBCM service
> This effectively makes the modex info for the openib BTL scale with
> the number of nodes, not the number of processes. This can be a big
> win in terms of overall modex size that needs to be both gathered and
> I worked up a spreadsheet showing the current size of the modex in the
> openib-cpc2 branch right now (using some "somewhat" contrived machine
> size/ppn/port combinations), and then compared it to the size after
> implementing the #2 idea shown above (see attached PDF).
> I also included a 3rd comparison for if Jon/I are able to reduce the
> CPC modex blob sizes -- we don't know yet if that'll work or not. But
> the numbers show that reducing the blobs by a few bytes clearly has
> [much] less of an impact than the "eliminating redundant modex
> information" idea, so we'll work on that one first.
> Additionally, reducing the modex size, paired with other ongoing ORTE
> scalability efforts, may obviate the need to eliminate the modex (at
> least for now...). Or, more specifically, efforts for eliminating the
> modex can be pushed to beyond v1.3.
> Of course, the same ideas can apply to other BTLs. We're only working
> on the openib BTL for now.