Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Scalability of openib modex
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-03-31 09:26:29


On Mar 31, 2008, at 9:22 AM, Ralph H Castain wrote:
> Thanks Jeff. It appears to me that the first approach to reducing
> modex data
> makes the most sense and has the largest impact - I would advocate
> pursuing
> it first. We can look at further refinements later.
>
> Along that line, one thing we also exchange in the modex (not IB
> specific)
> is hostname and arch. This is in the ompi/proc/proc.c code. It seems
> to me
> that this is also wasteful and can be removed. The daemons already
> have that
> info for the job and can easily "drop" it into each proc - there is no
> reason to send it around.
>
> I'll take a look at cleaning that up, ensuring we don't "break"
> daemonless
> environments, along with the other things underway.

Sounds perfect.

>
> Ralph
>
>
>
> On 3/28/08 11:37 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>
>> I've had this conversation independently with several people now, so
>> I'm sending it to the list rather than continuing to have the same
>> conversation over and over. :-)
>>
>> ------
>>
>> As most of you know, Jon and I are working on the new openib
>> "CPC" (connect pseudo-component) stuff in /tmp-public/openib-cpc2.
>> There are two main reasons for it:
>>
>> 1. Add support for RDMA CM (they need it for iWarp support)
>> 2. Add support for IB CM (which will hopefully be a more scalable
>> connect system as compared to the current RML/OOB-based method of
>> making IB QPs)
>>
>> When complete, there will be 4 CPCs: RDMA CM, IB CM, OOB, and XOOB
>> (same as OOB but with ConnectX XRC extensions).
>>
>> RDMA CM has some known scaling issues, and at least some known
>> workarounds -- I won't discuss the merits/drawbacks of RDMA CM here.
>> IB CM has unknown scaling characteristics, but seems to look good on
>> paper (e.g., it uses UD for a 3-way handshake to make an IB QP).
>>
>> On the trunk, it's a per-MPI process decision as to which CPC you'll
>> use. Per ticket #1191, one of the goals of the /tmp-public branch is
>> to make CPC decision be a per-openib-BTL-module decision. So you can
>> mix iWarp and IB hardware in a single host, for example. This fits
>> in
>> quite well with the "mpirun should work out of the box" philosophy of
>> Open MPI.
>>
>> In the openib BTL, each BTL module is paired with a specific HCA/NIC
>> (verbs) port. And depending on the interface hardware and software,
>> one or more CPCs may be available for each. Hence, for each BTL
>> module in each MPI process, we may send one or more CPC connect
>> information blobs in the modex (note that the oob and xoob CPCs don't
>> need to send anything additional in the modex).
>>
>> Jon and I are actually getting closer to completion on the branch,
>> and
>> it seems to be working.
>>
>> In conjunction with several other scalability discussions that are
>> ongoing right now, several of us have toyed with two basic ideas to
>> improve scalability of job launch / startup:
>>
>> 1. the possibility of eliminating the modex altogether (e.g., have
>> ORTE dump enough information to each MPI process to figure out/
>> calculate/locally lookup [in local files?] BTL addressing information
>> for all peers in MPI_COMM_WORLD, etc.), a la Portals.
>>
>> 2. reducing the amount of data in the modex.
>>
>> One obvious idea for #2 is to have only one process on each host send
>> all/the majority of openib BTL modex information for that host. The
>> rationale here is that all MPI processes on a single host will share
>> much of the same BTL addressing information, so why send it N times?
>> Local rank 0 can modex send all/the majority of the modex for the
>> openib BTL modules; local ranks 1-N can either send nothing or a
>> [very] small piece of differentiating information (e.g., IBCM service
>> ID).
>>
>> This effectively makes the modex info for the openib BTL scale with
>> the number of nodes, not the number of processes. This can be a big
>> win in terms of overall modex size that needs to be both gathered and
>> bcast.
>>
>> I worked up a spreadsheet showing the current size of the modex in
>> the
>> openib-cpc2 branch right now (using some "somewhat" contrived machine
>> size/ppn/port combinations), and then compared it to the size after
>> implementing the #2 idea shown above (see attached PDF).
>>
>> I also included a 3rd comparison for if Jon/I are able to reduce the
>> CPC modex blob sizes -- we don't know yet if that'll work or not.
>> But
>> the numbers show that reducing the blobs by a few bytes clearly has
>> [much] less of an impact than the "eliminating redundant modex
>> information" idea, so we'll work on that one first.
>>
>> Additionally, reducing the modex size, paired with other ongoing ORTE
>> scalability efforts, may obviate the need to eliminate the modex (at
>> least for now...). Or, more specifically, efforts for eliminating
>> the
>> modex can be pushed to beyond v1.3.
>>
>> Of course, the same ideas can apply to other BTLs. We're only
>> working
>> on the openib BTL for now.
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems