Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [EXTERNAL] What to do about openib/ofacm/cpc (was: r29703 - in trunk: contrib/p...)
From: Barrett, Brian W (bwbarre_at_[hidden])
Date: 2013-11-14 13:23:35


On 11/14/13 11:16 AM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:

>On Nov 14, 2013, at 1:03 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
>>> 1) What the status of UDCM is (does it work reliably, does it support
>>> XRC, etc.)
>>
>> Seems to be working okay on the IB systems at LANL and IU. Don't know
>>about XRC - I seem to recall the answer is "no"
>
>FWIW, I recall that when Cisco was testing UDCM (a long time ago --
>before we threw away our IB gear...), we found bugs in UDCM that only
>showed up with really large numbers of MTT tests running UDCM (i.e., 10K+
>tests a night, especially with lots of UDCM-based jobs running
>concurrently on the same cluster). These types of bugs didn't show up in
>casual testing.
>
>Has that happened with the new/fixed UDCM? Cisco is no longer in a
>position to test this.

Neither are we at Sandia, unfortunately. I only have 16 nodes for nightly
testing, and only 8 of those are always running Linux, so that doesn't
help much on the stress test.

>>> 2) What's the difference between CPCs and OFACM and what's our plans
>>> w.r.t 1.7 there?
>>
>> Pasha created ofacm because some of the collective components now need
>>to forge connections. So he created the common/ofacm code to meet those
>>needs, with the intention of someday replacing the openib cpc's with the
>>new common code. However, this was stalled by the iWarp issue, and so it
>>fell off the table.
>>
>> We now have two duplicate ways of doing the same thing, but with code
>>in two different places. :-(
>
>FWIW, the iWARP vendors have repeatedly been warned that ofacm is going
>to take over, and unless they supply patches, iWarp will stop working in
>Open MPI. I know for a fact that they are very aware of this.
>
>So my $0.02 is that ofacm should take over -- let's get rid of CPC and
>have openib use the ofacm. The iWarp folks can play catch up if/when
>they want to.
>
>Of course, I'm not in this part of the code base any more, so it's not
>really my call -- just my $0.02...

Of course, that doesn't help with the core issue; we can't have a
regression w.r.t XRC support between 1.7.3 and 1.7.4. But I agree, I'm
fine with only fixing this in one place.

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories