Ok, I think we're all good to go now:
- Brad's problems were initially cluster config errors, and later we
determined that they *may* be eHCA gen 1 issues with RDMA CM. We're
deferring fixing them for sure until after v1.3 because IBM doesn't
care about RDMA CM for eHCA.
- Jon's issues *look* like MPI layer issues, not BTL connectivity
issues. And they were spurrious. So we need to keep testing there.
However, I'm going to wait merging until after tomorrow's MTT morning
results because of the openib BTL breakage from today caused by the
ob1 commits yesterday. I'd like to get a good solid openib MTT test
night in before merging in all this new stuff.
On Oct 1, 2008, at 11:21 AM, Jon Mason wrote:
> On Wed, Oct 01, 2008 at 08:08:48AM -0400, Jeff Squyres wrote:
>> Per the call yesterday, I'll merge this into the trunk once I get it
>> working with Brad on PPC.
>> Brad and I discovered a missing htonl/ntohl somewhere in the code
>> night right before I had to go offline (i.e., we can see the IP
>> addresses are backwards, but don't know where it's coming from) on
>> so I haven't finished yet. We'll probably get it fixed up today.
> My tests yesterday showed some errors. Unfortunately, I lost the
> before I could take a look. I'll re-run them and verify that
> is still sane.
>> On Sep 30, 2008, at 10:05 AM, Jeff Squyres wrote:
>>> (putting this on devel just so that others can see it)
>>> Ok, I put in all the things in the RDMA CM CPC HG tree that we've
>>> talked about and it now should work out of the box with:
>>> - any iwarp (no need for kernel hacks to have initiator send first)
>>> - any IB (setup the stuff to do the initiator_depth and
>>> responder_resources properly)
>>> - any [valid but] bizarre IP addressing scheme
>>> Could everyone try the HG tree again to ensure it still/now works
>>> you out of the box?
>>> Try with changeset 106 (b046bf97deab) or later. The only thing that
>>> is missing is a bit better scalability on allocating buffers for the
>>> CTS. Now that all the other changes are in, I'll be working on that
>>> today and tomorrow.
>>> Jeff Squyres
>>> Cisco Systems
>> Jeff Squyres
>> Cisco Systems
>> devel mailing list
> devel mailing list