Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Galen Shipman (gshipman_at_[hidden])
Date: 2007-06-13 12:52:53


On Jun 13, 2007, at 10:48 AM, Jeff Squyres wrote:

> I wonder if this is bringing up the point that there are several of
> us working in the openib code base -- I wonder if it would be
> worthwhile to have a [short] teleconference to discuss what we're all
> doing in openib, where we're doing it (trunk, branch, whatever), when
> we expect to have it done, what version we need it in, etc. Just a
> coordination kind of teleconference. If people think this is a good
> idea, I can setup the call.

sounds good to me.

- Galen

>
> For example, don't forget that Nysal and I have the openib btl port-
> selection stuff off in /tmp/jnysal-openib-wireup (the btl_openib_if_
> [in|ex]clude MCA params). Per my prior e-mail, if no one objects, I
> will be bringing that stuff in to the trunk tomorrow evening (I'm
> pretty sure it won't conflict with what Galen is doing; Galen and I
> discussed on the phone this morning).
>
>
>
>
> On Jun 13, 2007, at 11:38 AM, Galen Shipman wrote:
>
>> Hi Gleb,
>>
>> As we have discussed before I am working on adding support for
>> multiple QPs with either per peer resources or shared resources.
>> As a result of this I am trying to clean up a lot of the OpenIB code.
>> It has grown up organically over the years and needs some attention.
>> Perhaps we can coordinate on commits or even work from the same temp
>> branch to do an overall cleanup as well as addressing the issue you
>> describe in this email.
>>
>> I bring this up because this commit will conflict quite a bit with
>> what I am working on, I can always merge it by hand but it may make
>> sense for us to get this all done in one area and then bring it all
>> over?
>>
>> Thanks,
>>
>> Galen
>>
>>
>> On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote:
>>
>>> Hello everyone,
>>>
>>> I encountered a problem with openib on depend connection code.
>>> Basically
>>> it works only by pure luck if you have more then one endpoint for
>>> the same
>>> proc and sometimes breaks in mysterious ways.
>>>
>>> The algo works like this: A wants to connect to B so it creates QP
>>> and sends it
>>> to B. B receives the QP from A and looks for endpoint that is not
>>> yet associated
>>> with remote endpoint, creates QP for it and sends info back. Now A
>>> receives
>>> the QP and goes through the same logic as B i.e looks for endpoint
>>> that is not
>>> yet connected, BUT there is no guaranty that it will find the
>>> endpoint that
>>> initiated the connection in the first place! And if it finds
>>> another one it will
>>> create QP for it and will send it back to B and so on and so forth.
>>> In the end
>>> I sometimes receive a peculiar mesh of connection where no QP has a
>>> connection
>>> back to it from the peer process.
>>>
>>> To overcome this problem B needs to send back some info that will
>>> allow A to
>>> determine the endpoint that initiated a connection request. The
>>> lid:qp pair
>>> will allow for this. But even then the problem will remain if two
>>> procs initiate
>>> connection at the same time. To dial with simultaneous connection
>>> asymmetry
>>> protocol have to be used one peer became master another slave.
>>> Slave alway
>>> initiate a connection to master. Master choose local endpoint to
>>> satisfy
>>> incoming request and sends info back to a slave. If master wants to
>>> initiate a
>>> connection it send message to a slave and slave initiate connection
>>> back to
>>> master.
>>>
>>> Included patch implements an algorithm described above and work for
>>> all
>>> scenarios for which current code fails to create a connection.
>>>
>>> --
>>> Gleb.
>>> <fix_openib_wireup.diff>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel