On Jun 13, 2007, at 10:48 AM, Jeff Squyres wrote:
> I wonder if this is bringing up the point that there are several of
> us working in the openib code base -- I wonder if it would be
> worthwhile to have a [short] teleconference to discuss what we're all
> doing in openib, where we're doing it (trunk, branch, whatever), when
> we expect to have it done, what version we need it in, etc. Just a
> coordination kind of teleconference. If people think this is a good
> idea, I can setup the call.
sounds good to me.
> For example, don't forget that Nysal and I have the openib btl port-
> selection stuff off in /tmp/jnysal-openib-wireup (the btl_openib_if_
> [in|ex]clude MCA params). Per my prior e-mail, if no one objects, I
> will be bringing that stuff in to the trunk tomorrow evening (I'm
> pretty sure it won't conflict with what Galen is doing; Galen and I
> discussed on the phone this morning).
> On Jun 13, 2007, at 11:38 AM, Galen Shipman wrote:
>> Hi Gleb,
>> As we have discussed before I am working on adding support for
>> multiple QPs with either per peer resources or shared resources.
>> As a result of this I am trying to clean up a lot of the OpenIB code.
>> It has grown up organically over the years and needs some attention.
>> Perhaps we can coordinate on commits or even work from the same temp
>> branch to do an overall cleanup as well as addressing the issue you
>> describe in this email.
>> I bring this up because this commit will conflict quite a bit with
>> what I am working on, I can always merge it by hand but it may make
>> sense for us to get this all done in one area and then bring it all
>> On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote:
>>> Hello everyone,
>>> I encountered a problem with openib on depend connection code.
>>> it works only by pure luck if you have more then one endpoint for
>>> the same
>>> proc and sometimes breaks in mysterious ways.
>>> The algo works like this: A wants to connect to B so it creates QP
>>> and sends it
>>> to B. B receives the QP from A and looks for endpoint that is not
>>> yet associated
>>> with remote endpoint, creates QP for it and sends info back. Now A
>>> the QP and goes through the same logic as B i.e looks for endpoint
>>> that is not
>>> yet connected, BUT there is no guaranty that it will find the
>>> endpoint that
>>> initiated the connection in the first place! And if it finds
>>> another one it will
>>> create QP for it and will send it back to B and so on and so forth.
>>> In the end
>>> I sometimes receive a peculiar mesh of connection where no QP has a
>>> back to it from the peer process.
>>> To overcome this problem B needs to send back some info that will
>>> allow A to
>>> determine the endpoint that initiated a connection request. The
>>> lid:qp pair
>>> will allow for this. But even then the problem will remain if two
>>> procs initiate
>>> connection at the same time. To dial with simultaneous connection
>>> protocol have to be used one peer became master another slave.
>>> Slave alway
>>> initiate a connection to master. Master choose local endpoint to
>>> incoming request and sends info back to a slave. If master wants to
>>> initiate a
>>> connection it send message to a slave and slave initiate connection
>>> back to
>>> Included patch implements an algorithm described above and work for
>>> scenarios for which current code fails to create a connection.
>>> devel mailing list
>> devel mailing list
> Jeff Squyres
> Cisco Systems
> devel mailing list