Here's the core problem: it isn't a question of "if" some of these things should be resolved, but "who". They've been around for a very long time, but nobody has the time or will to fix them. I have no access to such machines, so all I can do is verify that it sorta compiles and is consistent with the code base. I can't verify that it works, nor debug it.

Guess my point is that someone who cares needs to cleanup the cpc vs ofacm problem and get whatever connection managers we want to support working. I removed the oob and xoob ones because (a) they don't work, and (b) I'm tired of repeatedly having to explain that to people.


On Nov 14, 2013, at 10:23 AM, Barrett, Brian W <bwbarre@sandia.gov> wrote:

On 11/14/13 11:16 AM, "Jeff Squyres (jsquyres)" <jsquyres@cisco.com> wrote:

On Nov 14, 2013, at 1:03 PM, Ralph Castain <rhc@open-mpi.org> wrote:

1) What the status of UDCM is (does it work reliably, does it support
XRC, etc.)

Seems to be working okay on the IB systems at LANL and IU. Don't know
about XRC - I seem to recall the answer is "no"

FWIW, I recall that when Cisco was testing UDCM (a long time ago --
before we threw away our IB gear...), we found bugs in UDCM that only
showed up with really large numbers of MTT tests running UDCM (i.e., 10K+
tests a night, especially with lots of UDCM-based jobs running
concurrently on the same cluster).  These types of bugs didn't show up in
casual testing.

Has that happened with the new/fixed UDCM?  Cisco is no longer in a
position to test this.

Neither are we at Sandia, unfortunately.  I only have 16 nodes for nightly
testing, and only 8 of those are always running Linux, so that doesn't
help much on the stress test.

2) What's the difference between CPCs and OFACM and what's our plans
w.r.t 1.7 there?

Pasha created ofacm because some of the collective components now need
to forge connections. So he created the common/ofacm code to meet those
needs, with the intention of someday replacing the openib cpc's with the
new common code. However, this was stalled by the iWarp issue, and so it
fell off the table.

We now have two duplicate ways of doing the same thing, but with code
in two different places. :-(

FWIW, the iWARP vendors have repeatedly been warned that ofacm is going
to take over, and unless they supply patches, iWarp will stop working in
Open MPI.  I know for a fact that they are very aware of this.

So my $0.02 is that ofacm should take over -- let's get rid of CPC and
have openib use the ofacm.  The iWarp folks can play catch up if/when
they want to.  

Of course, I'm not in this part of the code base any more, so it's not
really my call -- just my $0.02...

Of course, that doesn't help with the core issue; we can't have a
regression w.r.t XRC support between 1.7.3 and 1.7.4.  But I agree, I'm
fine with only fixing this in one place.

Brian

--
 Brian W. Barrett
 Scalable System Software Group
 Sandia National Laboratories




_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel