On Wed, Jun 13, 2007 at 09:38:21AM -0600, Galen Shipman wrote:
> Hi Gleb,
> As we have discussed before I am working on adding support for
> multiple QPs with either per peer resources or shared resources.
> As a result of this I am trying to clean up a lot of the OpenIB code.
> It has grown up organically over the years and needs some attention.
> Perhaps we can coordinate on commits or even work from the same temp
> branch to do an overall cleanup as well as addressing the issue you
> describe in this email.
> I bring this up because this commit will conflict quite a bit with
> what I am working on, I can always merge it by hand but it may make
> sense for us to get this all done in one area and then bring it all
I am not committing this yet. I want people to review my logic and the
patch. If the change is OK with everyone how cares then I want this
change to go into 1.2 branch.
I don't care how this change will get to the trunk. I can use patched
version for a while. If you branch is in working state right now I can
merge this change into it tomorrow.
> On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote:
> > Hello everyone,
> > I encountered a problem with openib on depend connection code.
> > Basically
> > it works only by pure luck if you have more then one endpoint for
> > the same
> > proc and sometimes breaks in mysterious ways.
> > The algo works like this: A wants to connect to B so it creates QP
> > and sends it
> > to B. B receives the QP from A and looks for endpoint that is not
> > yet associated
> > with remote endpoint, creates QP for it and sends info back. Now A
> > receives
> > the QP and goes through the same logic as B i.e looks for endpoint
> > that is not
> > yet connected, BUT there is no guaranty that it will find the
> > endpoint that
> > initiated the connection in the first place! And if it finds
> > another one it will
> > create QP for it and will send it back to B and so on and so forth.
> > In the end
> > I sometimes receive a peculiar mesh of connection where no QP has a
> > connection
> > back to it from the peer process.
> > To overcome this problem B needs to send back some info that will
> > allow A to
> > determine the endpoint that initiated a connection request. The
> > lid:qp pair
> > will allow for this. But even then the problem will remain if two
> > procs initiate
> > connection at the same time. To dial with simultaneous connection
> > asymmetry
> > protocol have to be used one peer became master another slave.
> > Slave alway
> > initiate a connection to master. Master choose local endpoint to
> > satisfy
> > incoming request and sends info back to a slave. If master wants to
> > initiate a
> > connection it send message to a slave and slave initiate connection
> > back to
> > master.
> > Included patch implements an algorithm described above and work for
> > all
> > scenarios for which current code fails to create a connection.
> > --
> > Gleb.
> > <fix_openib_wireup.diff>
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> devel mailing list