I tried to test the latest hg tree but it failes from time to time
it happens on different machines with different errors ( see attached file )
It also failes when ib0 is set to slave mode due to bonding, but I am sure that it happens "by design".
Annnnddd.... the pendulum swings back the other way now. :-)
See the ticket for details: https://svn.open-mpi.org/trac/ompi/ticket/1540
Short version: OMPI now just "figures it out" and does the right thing.--
On Sep 28, 2008, at 7:27 AM, Jeff Squyres wrote:
Actually, I thought about this one more, and I have concluded that we do *not* want to do this (i.e., allow RDMA CM to send requests for port A from port B. If we do this, then it would be possible that *all* traffic will go the "wrong" way. More specifically, OMPI will not have direct control over what traffic goes over what port -- and that would be Bad.
So we'll still lookup the peer based on the address where the connect request came from, and I'll eventually add a FAQ item about it (because IP addressing is much more flexible than IB addressing, and netadmins may be tempted to use a "flat" address space).
On Sep 26, 2008, at 5:53 PM, Jeff Squyres wrote:
On Sep 26, 2008, at 5:45 PM, Jeff Squyres wrote:
I actually spent all afternoon diagnosing something that I'll turn into a FAQ entry (OMPI's RDMA CM TCP addressing requirements are stronger than TCP's legal addressing rules). In short, OMPI needs the RDMA CM to guarantee that requests to connect to port A come from port A. If you have a "flat" network address space, RDMA CM may actually issue a connect request for port A from port B. This causes OMPI to get confused because it will not find the right BTL openib endpoint to connect to.
And... crap. We can fix this one, too.
Right now, we use the IP address from the incoming RDMA CM event ID to determine who the caller is. But we could easily embed the IP address (i.e., endpoint designator) in the private data in the event so that the peer can look at *that* address to identify who the peer is (rather than the address embedded in the event ID).
This is actually what the IB CM CPC does, IIRC.
Blah. This is also not hard, but it's another task for later. :-)