Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Galen Shipman (gshipman_at_[hidden])
Date: 2007-06-13 12:52:53


On Jun 13, 2007, at 10:48 AM, Jeff Squyres wrote:

> I wonder if this is bringing up the point that there are several of
> us working in the openib code base -- I wonder if it would be
> worthwhile to have a [short] teleconference to discuss what we're all
> doing in openib, where we're doing it (trunk, branch, whatever), when
> we expect to have it done, what version we need it in, etc. Just a
> coordination kind of teleconference. If people think this is a good
> idea, I can setup the call.

sounds good to me.

- Galen

>
> For example, don't forget that Nysal and I have the openib btl port-
> selection stuff off in /tmp/jnysal-openib-wireup (the btl_openib_if_
> [in|ex]clude MCA params). Per my prior e-mail, if no one objects, I
> will be bringing that stuff in to the trunk tomorrow evening (I'm
> pretty sure it won't conflict with what Galen is doing; Galen and I
> discussed on the phone this morning).
>
>
>
>
> On Jun 13, 2007, at 11:38 AM, Galen Shipman wrote:
>
>> Hi Gleb,
>>
>> As we have discussed before I am working on adding support for
>> multiple QPs with either per peer resources or shared resources.
>> As a result of this I am trying to clean up a lot of the OpenIB code.
>> It has grown up organically over the years and needs some attention.
>> Perhaps we can coordinate on commits or even work from the same temp
>> branch to do an overall cleanup as well as addressing the issue you
>> describe in this email.
>>
>> I bring this up because this commit will conflict quite a bit with
>> what I am working on, I can always merge it by hand but it may make
>> sense for us to get this all done in one area and then bring it all
>> over?
>>
>> Thanks,
>>
>> Galen
>>
>>
>> On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote:
>>
>>> Hello everyone,
>>>
>>> I encountered a problem with openib on depend connection code.
>>> Basically
>>> it works only by pure luck if you have more then one endpoint for
>>> the same
>>> proc and sometimes breaks in mysterious ways.
>>>
>>> The algo works like this: A wants to connect to B so it creates QP
>>> and sends it
>>> to B. B receives the QP from A and looks for endpoint that is not
>>> yet associated
>>> with remote endpoint, creates QP for it and sends info back. Now A
>>> receives
>>> the QP and goes through the same logic as B i.e looks for endpoint
>>> that is not
>>> yet connected, BUT there is no guaranty that it will find the
>>> endpoint that
>>> initiated the connection in the first place! And if it finds
>>> another one it will
>>> create QP for it and will send it back to B and so on and so forth.
>>> In the end
>>> I sometimes receive a peculiar mesh of connection where no QP has a
>>> connection
>>> back to it from the peer process.
>>>
>>> To overcome this problem B needs to send back some info that will
>>> allow A to
>>> determine the endpoint that initiated a connection request. The
>>> lid:qp pair
>>> will allow for this. But even then the problem will remain if two
>>> procs initiate
>>> connection at the same time. To dial with simultaneous connection
>>> asymmetry
>>> protocol have to be used one peer became master another slave.
>>> Slave alway
>>> initiate a connection to master. Master choose local endpoint to
>>> satisfy
>>> incoming request and sends info back to a slave. If master wants to
>>> initiate a
>>> connection it send message to a slave and slave initiate connection
>>> back to
>>> master.
>>>
>>> Included patch implements an algorithm described above and work for
>>> all
>>> scenarios for which current code fails to create a connection.
>>>
>>> --
>>> Gleb.
>>> <fix_openib_wireup.diff>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel