On Sun, Mar 16, 2014 at 08:19:32AM -0700, Ralph Castain wrote:
> On Mar 15, 2014, at 10:19 PM, Hjelm, Nathan T <hjelmn_at_[hidden]> wrote:
> > On Friday, March 14, 2014 8:48 PM, devel [devel-bounces_at_[hidden]] on behalf of Ralph Castain [rhc_at_[hidden]] wrote:
> >> To: Open MPI Developers
> >> Subject: [OMPI devel] 1.7.5 end-of-week status report
> >> Hi folks
> >> I have both good and bad news to report - first the good.
> >> OSHMEM now passes nearly all its tests on my Linux cluster (tcp). My hat is off to the Mellanox guys for getting this done, including getting our MTT repo tests complete.
> >> The MPI layer passes nearly all the IBM, Intel, and one-sided tests. Only a few failures.
> >> Now the bad. The coll/ml component continues to have problems, including segfaults, and I have discovered that the bcol and coll/ml code remains entangled (I thought it had been separated, but sadly not). I have therefore ompi_ignored coll/ml and bcol/ptpcoll.
> > No need. I discovered a bug in my last coll/ml fix. It incorrectly handled one of the possibly hierarchies. The bug is fixed in trunk and a CMR is open for 1.7.5. In the future I will clean up this path but the fix should have us working again.
> I'm glad you were able to patch it, but this still begs the question of what to do with coll/ml. It's disturbing that its existence alone was enough to break the Java bindings (and yes, I concede those aren't built by default or part of the MPI standard) without even traversing its code path, and we've had a lot of problems with errors when we do go thru it. More disturbing, you can't even cleanly no-build that component due to the unfortunate cross-linkage with bcol/ptpcoll, so we definitely need a note in NEWS to warn people they need to no-build both.
I thought ORNL had addresed the cross-linkage as well. I am sure they
will get a fix for that in the next couple of days.
> It's unclear to me how to handle this situation, so we'll need to discuss it at the telecon. At the very least, I think we need to ensure coll/ml is not the default for 1.7.5 as it doesn't appear to be ready for that role.
coll/ml is not the default. The issue here is that we have to generate
and parse the topology at collective select time. This will happen even
if coll/ml is not the highest priority collective component. I fixed the
one issue with parsing the topology and then an issue with that
fix. To be clear, the original issue only occured on OSX with debug
builds. This is a setup LANL (and I am sure ORNL) doesn't test.
I really didn't care about the Java problem but the fix was simple
enough. It is easy to verify that the code Jeff fixed was the only place
in coll/ml where a large buffer was put on the stack.
- application/pgp-signature attachment: stored