Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Gleb Natapov (glebn_at_[hidden])
Date: 2007-06-13 13:40:20


On Wed, Jun 13, 2007 at 10:52:53AM -0600, Galen Shipman wrote:
>
> On Jun 13, 2007, at 10:48 AM, Jeff Squyres wrote:
>
> > I wonder if this is bringing up the point that there are several of
> > us working in the openib code base -- I wonder if it would be
> > worthwhile to have a [short] teleconference to discuss what we're all
> > doing in openib, where we're doing it (trunk, branch, whatever), when
> > we expect to have it done, what version we need it in, etc. Just a
> > coordination kind of teleconference. If people think this is a good
> > idea, I can setup the call.
>
> sounds good to me.
Sounds good to me to. Pasha also works on async event thread. This patch
is not something I planned to work on. This problem prevented me from
testing my changes to OB1 an is serious enough to be fixed on v1.2.

>
> - Galen
>
> >
> > For example, don't forget that Nysal and I have the openib btl port-
> > selection stuff off in /tmp/jnysal-openib-wireup (the btl_openib_if_
> > [in|ex]clude MCA params). Per my prior e-mail, if no one objects, I
> > will be bringing that stuff in to the trunk tomorrow evening (I'm
> > pretty sure it won't conflict with what Galen is doing; Galen and I
> > discussed on the phone this morning).
> >
> >
> >
> >
> > On Jun 13, 2007, at 11:38 AM, Galen Shipman wrote:
> >
> >> Hi Gleb,
> >>
> >> As we have discussed before I am working on adding support for
> >> multiple QPs with either per peer resources or shared resources.
> >> As a result of this I am trying to clean up a lot of the OpenIB code.
> >> It has grown up organically over the years and needs some attention.
> >> Perhaps we can coordinate on commits or even work from the same temp
> >> branch to do an overall cleanup as well as addressing the issue you
> >> describe in this email.
> >>
> >> I bring this up because this commit will conflict quite a bit with
> >> what I am working on, I can always merge it by hand but it may make
> >> sense for us to get this all done in one area and then bring it all
> >> over?
> >>
> >> Thanks,
> >>
> >> Galen
> >>
> >>
> >> On Jun 13, 2007, at 7:27 AM, Gleb Natapov wrote:
> >>
> >>> Hello everyone,
> >>>
> >>> I encountered a problem with openib on depend connection code.
> >>> Basically
> >>> it works only by pure luck if you have more then one endpoint for
> >>> the same
> >>> proc and sometimes breaks in mysterious ways.
> >>>
> >>> The algo works like this: A wants to connect to B so it creates QP
> >>> and sends it
> >>> to B. B receives the QP from A and looks for endpoint that is not
> >>> yet associated
> >>> with remote endpoint, creates QP for it and sends info back. Now A
> >>> receives
> >>> the QP and goes through the same logic as B i.e looks for endpoint
> >>> that is not
> >>> yet connected, BUT there is no guaranty that it will find the
> >>> endpoint that
> >>> initiated the connection in the first place! And if it finds
> >>> another one it will
> >>> create QP for it and will send it back to B and so on and so forth.
> >>> In the end
> >>> I sometimes receive a peculiar mesh of connection where no QP has a
> >>> connection
> >>> back to it from the peer process.
> >>>
> >>> To overcome this problem B needs to send back some info that will
> >>> allow A to
> >>> determine the endpoint that initiated a connection request. The
> >>> lid:qp pair
> >>> will allow for this. But even then the problem will remain if two
> >>> procs initiate
> >>> connection at the same time. To dial with simultaneous connection
> >>> asymmetry
> >>> protocol have to be used one peer became master another slave.
> >>> Slave alway
> >>> initiate a connection to master. Master choose local endpoint to
> >>> satisfy
> >>> incoming request and sends info back to a slave. If master wants to
> >>> initiate a
> >>> connection it send message to a slave and slave initiate connection
> >>> back to
> >>> master.
> >>>
> >>> Included patch implements an algorithm described above and work for
> >>> all
> >>> scenarios for which current code fails to create a connection.
> >>>
> >>> --
> >>> Gleb.
> >>> <fix_openib_wireup.diff>
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> >
> > --
> > Jeff Squyres
> > Cisco Systems
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
			Gleb.