Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] IBCM error
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-07-15 07:30:58


On 7/15/08 5:05 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:

> On Jul 14, 2008, at 3:04 PM, Ralph H. Castain wrote:
>
>> I've been quietly following this discussion, but now feel a need to
>> jump
>> in here. I really must disagree with the idea of building either
>> IBCM or
>> RDMACM support by default. Neither of these has been proven to
>> reliably
>> work, or to be advantageous. Our own experiences in testing them
>> have been
>> slightly negative at best. When the did work, they were slower, didn't
>> scale well, and unreliable.
>
> Minor clarification: we did not test RDMACM on RoadRunner.

Just for further clarification - I did, and it wasn't a particularly good
experience. Encountered several problems, none of them overwhelming, hence
my comments.

>
> We only tested IBCM at scale (not RDMACM) and ran into a variety of
> issues -- most of which were bugs in Open MPI's use of IBCM -- that
> culminated in the ib_cm_listen() problem. That problem is currently
> unsolved, and I agree that it unfortunately currently makes OMPI's
> IBCM support fairly useless. Bonk.
>
> IBCM was thought to be a nice thing: a cheap/fast way to make IB
> connections that would get OOB out of the picture. If the
> ib_cm_listen() problem is fixed, it may still be (Sean had an
> interesting suggestion; we'll see where it goes). But I totally agree
> that it is somewhat of an unknown quantity at this point. I also
> agree that the IBCM support in OMPI is not *necessary* because OOB
> works just fine (especially with the scalability improvements in v1.3).
>
> RDMACM, on the other hand, is *necessary* for iWARP connections. We
> know it won't scale well because of ARP issues, to which the iWARP
> vendors are publishing their own solutions (pre-populating ARP caches,
> etc.). Even when built and installed, RDMACM will not be used by
> default for IB hardware (you have to specifically ask for it). Since
> it's necessary for iWARP, I think we need to build and install it by
> default. Most importantly: production IB users won't be disturbed.

If it is necessary for iWARP, then fine - so long as it is only used if
specifically requested.

However, I would also ask that we be able to -not- build it upon request so
we can be certain a user doesn't attempt to use it by mistake ("gee, that
looks interesting - let Mikey try it!"). Ditto for ibcm support.

This way, we can experiment with it and continue to learn the problems
without forcing our production people to deal with problem tickets because a
user tried something that has known problems.

>
>> I'm not trying to rain on anyone's parade. These are worthwhile in the
>> long term. However, they clearly need further work to be "ready for
>> prime
>> time".
>>
>> Accordingly, I would recommend that they -only- be built if
>> specifically
>> requested. Remember, most of our users just build blindly. It makes no
>> sense to have them build support for what can only be classed as an
>> experimental capability at this time.
>
> I defer to Mellanox for a decision about the IBCM CPC.
>
> But for the RDMACM, per above, I am still in favor of building and
> installing it by default.

Like I said, no problem - but give me a configure option so I can -not-
build it too.

>
>> Also, note that the OFED install is less-than-reliable wrt IBCM and
>> RDMACM.
>
> True; the OFED install is less-than-reliable w.r.t. IBCM per the
> previously-discussed issue of not necessarily creating the /dev/
> infiniband/ucm* devices. There's a ticket open on the OpenFabrics
> bugzilla about it. I wish it would get fixed. :-)
>
> But I've not seen any problems with OFED's RDMACM installation.
>
> The only issue I've seen with RDMACM is when sites consciously chose
> to put the RDMACM libraries and/or modules on the head node (and
> therefore OMPI built support for it), but then chose not put them out
> on back-end compute nodes. Keep in mind that this is *not* the
> default OFED installation pattern -- a human has to go manually modify
> a file to make it do that (I don't believe that there's even a menu
> option for that installation mode; you have to go hand-edit an OFED
> installation configuration file or simply choose not to put / remove
> certain RPMs out on back-end nodes).

Guess what - we don't always put them out there because - tada - we don't
use them! What goes out on the backend is a stripped down version of
libraries we require. Given the huge number of libraries people provide
(looking at the bigger, beyond OMPI picture), it consumes a lot of limited
disk space to install every library on every node. So sometimes we build our
own rpm's to pick up only what we need.

As long as --without-rdmacm --without-ibcm are present, then we are happy.

>
>> We have spent considerable time chasing down installation problems
>> that allowed the system to build, but then caused it to crash-and-
>> burn if
>> we attempted to use it. We have gained experience at knowing when/
>> where to
>> look now, but that doesn't lessen the reputation impact OMPI is
>> getting as
>> a "buggy, cantankerous beast" according to our sys admins.
>
> Isn't the whole point of pre-release test versions is to find and fix
> such bugs? ;-)

Tell that to a sys admin of a production system - better wear your helmet.

>
>> Not a reputation we should be encouraging.
>>
>> Turning this off by default allows those more adventurous souls to
>> explore
>> this capability, while letting our production-oriented customers
>> install
>> and run in peace.
>
>
> Pasha was recommending that IBCM be built by default *but not used by
> default*. So production users would still be able to run in peace --
> OOB will still be the default. I see it pretty much like SLURM
> support: it's built by default, but it won't activate itself unless
> relevant. But like I said above, I defer to Mellanox for IBCM. :-)

I can turn off building SLURM support - can I do the same with ibcm and
rdmacm? No - which is the crux of the problem.

Ralph

>
> Just my $0.00000000002...