Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] rdma_connect() failure
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2008-10-05 14:45:59


Hi Jeff,

I tried to test the latest hg tree but it failes from time to time

it happens on different machines with different errors ( see attached file )

It also failes when ib0 is set to slave mode due to bonding, but I am sure
that it happens "by design".

Lenny.

On 9/29/08, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
> Annnnddd.... the pendulum swings back the other way now. :-)
>
> See the ticket for details: https://svn.open-mpi.org/trac/ompi/ticket/1540
>
> Short version: OMPI now just "figures it out" and does the right thing.
>
>
> On Sep 28, 2008, at 7:27 AM, Jeff Squyres wrote:
>
> Actually, I thought about this one more, and I have concluded that we do
>> *not* want to do this (i.e., allow RDMA CM to send requests for port A from
>> port B. If we do this, then it would be possible that *all* traffic will go
>> the "wrong" way. More specifically, OMPI will not have direct control over
>> what traffic goes over what port -- and that would be Bad.
>>
>> So we'll still lookup the peer based on the address where the connect
>> request came from, and I'll eventually add a FAQ item about it (because IP
>> addressing is much more flexible than IB addressing, and netadmins may be
>> tempted to use a "flat" address space).
>>
>>
>>
>> On Sep 26, 2008, at 5:53 PM, Jeff Squyres wrote:
>>
>> On Sep 26, 2008, at 5:45 PM, Jeff Squyres wrote:
>>>
>>> I actually spent all afternoon diagnosing something that I'll turn into
>>>> a FAQ entry (OMPI's RDMA CM TCP addressing requirements are stronger than
>>>> TCP's legal addressing rules). In short, OMPI needs the RDMA CM to
>>>> guarantee that requests to connect to port A come from port A. If you have
>>>> a "flat" network address space, RDMA CM may actually issue a connect request
>>>> for port A from port B. This causes OMPI to get confused because it will
>>>> not find the right BTL openib endpoint to connect to.
>>>>
>>>
>>>
>>> And... crap. We can fix this one, too.
>>>
>>> Right now, we use the IP address from the incoming RDMA CM event ID to
>>> determine who the caller is. But we could easily embed the IP address
>>> (i.e., endpoint designator) in the private data in the event so that the
>>> peer can look at *that* address to identify who the peer is (rather than the
>>> address embedded in the event ID).
>>>
>>> This is actually what the IB CM CPC does, IIRC.
>>>
>>> Blah. This is also not hard, but it's another task for later. :-)
>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>>
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>>
>
> --
> Jeff Squyres
> Cisco Systems
>
>