Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [EXTERNAL] What to do about openib/ofacm/cpc (was: r29703 - in trunk: contrib/p...)
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-11-14 13:55:37


Here's the core problem: it isn't a question of "if" some of these things should be resolved, but "who". They've been around for a very long time, but nobody has the time or will to fix them. I have no access to such machines, so all I can do is verify that it sorta compiles and is consistent with the code base. I can't verify that it works, nor debug it.

Guess my point is that someone who cares needs to cleanup the cpc vs ofacm problem and get whatever connection managers we want to support working. I removed the oob and xoob ones because (a) they don't work, and (b) I'm tired of repeatedly having to explain that to people.

On Nov 14, 2013, at 10:23 AM, Barrett, Brian W <bwbarre_at_[hidden]> wrote:

> On 11/14/13 11:16 AM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:
>
>> On Nov 14, 2013, at 1:03 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>
>>>> 1) What the status of UDCM is (does it work reliably, does it support
>>>> XRC, etc.)
>>>
>>> Seems to be working okay on the IB systems at LANL and IU. Don't know
>>> about XRC - I seem to recall the answer is "no"
>>
>> FWIW, I recall that when Cisco was testing UDCM (a long time ago --
>> before we threw away our IB gear...), we found bugs in UDCM that only
>> showed up with really large numbers of MTT tests running UDCM (i.e., 10K+
>> tests a night, especially with lots of UDCM-based jobs running
>> concurrently on the same cluster). These types of bugs didn't show up in
>> casual testing.
>>
>> Has that happened with the new/fixed UDCM? Cisco is no longer in a
>> position to test this.
>
> Neither are we at Sandia, unfortunately. I only have 16 nodes for nightly
> testing, and only 8 of those are always running Linux, so that doesn't
> help much on the stress test.
>
>>>> 2) What's the difference between CPCs and OFACM and what's our plans
>>>> w.r.t 1.7 there?
>>>
>>> Pasha created ofacm because some of the collective components now need
>>> to forge connections. So he created the common/ofacm code to meet those
>>> needs, with the intention of someday replacing the openib cpc's with the
>>> new common code. However, this was stalled by the iWarp issue, and so it
>>> fell off the table.
>>>
>>> We now have two duplicate ways of doing the same thing, but with code
>>> in two different places. :-(
>>
>> FWIW, the iWARP vendors have repeatedly been warned that ofacm is going
>> to take over, and unless they supply patches, iWarp will stop working in
>> Open MPI. I know for a fact that they are very aware of this.
>>
>> So my $0.02 is that ofacm should take over -- let's get rid of CPC and
>> have openib use the ofacm. The iWarp folks can play catch up if/when
>> they want to.
>>
>> Of course, I'm not in this part of the code base any more, so it's not
>> really my call -- just my $0.02...
>
> Of course, that doesn't help with the core issue; we can't have a
> regression w.r.t XRC support between 1.7.3 and 1.7.4. But I agree, I'm
> fine with only fixing this in one place.
>
> Brian
>
> --
> Brian W. Barrett
> Scalable System Software Group
> Sandia National Laboratories
>
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel