On 11/14/2013 12:16 PM, Jeff Squyres (jsquyres) wrote:
> On Nov 14, 2013, at 1:03 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>> 1) What the status of UDCM is (does it work reliably, does it support
>>> XRC, etc.)
>> Seems to be working okay on the IB systems at LANL and IU. Don't know about XRC - I seem to recall the answer is "no"
> FWIW, I recall that when Cisco was testing UDCM (a long time ago -- before we threw away our IB gear...), we found bugs in UDCM that only showed up with really large numbers of MTT tests running UDCM (i.e., 10K+ tests a night, especially with lots of UDCM-based jobs running concurrently on the same cluster). These types of bugs didn't show up in casual testing.
> Has that happened with the new/fixed UDCM? Cisco is no longer in a position to test this.
>>> 2) What's the difference between CPCs and OFACM and what's our plans
>>> w.r.t 1.7 there?
>> Pasha created ofacm because some of the collective components now need to forge connections. So he created the common/ofacm code to meet those needs, with the intention of someday replacing the openib cpc's with the new common code. However, this was stalled by the iWarp issue, and so it fell off the table.
Perhaps if Pasha or somebody else proficient in the OMPI code could help
out, then the iWARP CPC could be moved. W/O help from OMPI developers,
its going to take me a very long time...
>> We now have two duplicate ways of doing the same thing, but with code in two different places. :-(
> FWIW, the iWARP vendors have repeatedly been warned that ofacm is going to take over, and unless they supply patches, iWarp will stop working in Open MPI. I know for a fact that they are very aware of this.
> So my $0.02 is that ofacm should take over -- let's get rid of CPC and have openib use the ofacm. The iWarp folks can play catch up if/when they want to.
> Of course, I'm not in this part of the code base any more, so it's not really my call -- just my $0.02...
Can't we leave the openib rdma CPC code as is until we can get the
rdmacm CPC moved into OFACM. What is the harm with that exactly? I
mean, if no iWARP devices support these accelerated MPI collectives,
then leave the rdmacm CPC in the openib btl so we can at least support
iWARP via the openib BTL...