Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Gleb Natapov (glebn_at_[hidden])
Date: 2006-12-06 02:05:52


On Wed, Dec 06, 2006 at 12:14:35PM +0530, Chevchenkovic Chevchenkovic wrote:
> Hi,
> Actually I was wondering why there is a facility for having multiple
> LIDs for the same port. This led me to the entire series of questions.
> It is still not very clear to, as to what is the advantage of
> assigning multiple LIDs to the same port. Does it give some
> performance advantages?
Each LID has its own path through the fabric (ideally), this is the way to
lower a congestion.

> -Chev
>
>
> On 12/5/06, Jeff Squyres <jsquyres_at_[hidden]> wrote:
> > There are two distinct layers of software being discussed here:
> >
> > - the PML (basically the back-end to MPI_SEND and friends)
> > - the BTL (byte transfer layer, the back-end bit movers for the ob1
> > and dr PMLs -- this distinction is important because there is nothing
> > in the PML design that forces the use of BTL's; indeed, there is at
> > least one current PML that does not use BTL's as the back-end bit
> > mover [the cm PML])
> >
> > The ob1 and dr PMLs know nothing about how the back-end bitmovers
> > work (BTL components) -- the BTLs are given considerable freedom to
> > operate within their specific interface contracts.
> >
> > Generally, ob1/dr queries each BTL component when Open MPI starts
> > up. Each BTL responds with whether it wants to run or not. If it
> > does, it gives back the one or more modules (think of a module as an
> > "instance" of a component). Typically, these modules correspond to
> > multiple NICs / HCAs / network endpoints. For example, if you have 2
> > ethernet cards, the tcp BTL will create and return 2 modules. ob1 /
> > dr will treat these as two paths to send data (reachability is
> > computed as well, of course -- ob1/dr will only send data down btls
> > for which the target peer is reachable). In general, ob1/dr will
> > round-robin across all available BTL modules when sending large
> > messages (as Gleb has described). See http://www.open-mpi.org/papers/
> > euro-pvmmpi-2006-hpc-protocols/ for a general description of the ob1/
> > dr protocols.
> >
> > The openib BTL can return multiple modules if multiple LIDs are
> > available. So the ob1/dr doesn't know that these are not physical
> > devices -- it just treats each module as an equivalent mechanism to
> > send data.
> >
> > This is actually somewhat lame as a scheme, and we talked internally
> > about doing something more intelligent. But we decided to hold off
> > and let people (like you!) with real-world apps and networks give
> > this stuff a try and see what really works (and what doesn't work)
> > before trying to implement anything else.
> >
> > So -- all that explanation aside -- we'd love to hear your feedback
> > with regards to the multi-LID stuff in Open MPI. :-)
> >
> >
> >
> > On Dec 4, 2006, at 1:27 PM, Chevchenkovic Chevchenkovic wrote:
> >
> > > Thanks for that.
> > >
> > > Suppose, if there there are multiple interconnects, say ethernet and
> > > infiniband and a million byte of data is to be sent, then in this
> > > case the data will be sent through infiniband (since its a fast path
> > > .. please correct me here if i m wrong).
> > >
> > > If there are mulitple such sends, do you mean to say that each send
> > > will go through different BTLs in a RR manner if they are connected
> > > to the same port?
> > >
> > > -chev
> > >
> > >
> > > On 12/4/06, Gleb Natapov <glebn_at_[hidden]> wrote:
> > >> On Mon, Dec 04, 2006 at 10:53:26PM +0530, Chevchenkovic
> > >> Chevchenkovic wrote:
> > >>> Hi,
> > >>> It is not clear from the code as mentioned by you from
> > >>> ompi/mca/pml/ob1/ where exactly the selection of BTL bound to a
> > >>> particular LID occurs. Could you please specify the file/function
> > >>> name
> > >>> for the same?
> > >> There is no such code there. OB1 knows nothing about LIDs. It does RR
> > >> over all available interconnects. It can do RR between ethernet, IB
> > >> and Myrinet for instance. BTL presents each LID as different
> > >> virtual HCA
> > >> to OB1 and it does round-robin between them without event knowing
> > >> this
> > >> is the same port of the same HCA.
> > >>
> > >> Can you explain what are you trying to achieve?
> > >>
> > >>> -chev
> > >>>
> > >>>
> > >>> On 12/4/06, Gleb Natapov <glebn_at_[hidden]> wrote:
> > >>>> On Mon, Dec 04, 2006 at 01:07:08AM +0530, Chevchenkovic
> > >>>> Chevchenkovic wrote:
> > >>>>> Also could you please tell me which part of the openMPI code
> > >>>>> needs to
> > >>>>> be touched so that I can do some modifications in it to
> > >>>>> incorporate
> > >>>>> changes regarding LID selection...
> > >>>>>
> > >>>> It depend what do you want to do. The part that does RR over all
> > >>>> available LIDs is in OB1 PML (ompi/mca/pml/ob1/), but the code
> > >>>> doesn't
> > >>>> aware of the fact that it is doing RR over different LIDs and not
> > >>>> different NICs (yet?).
> > >>>>
> > >>>> The code that controls what LIDs will be used is in
> > >>>> ompi/mca/btl/openib/btl_openib_component.c.
> > >>>>
> > >>>>> On 12/4/06, Chevchenkovic Chevchenkovic
> > >>>>> <chevchenkovic_at_[hidden]> wrote:
> > >>>>>> Is it possible to control the LID where the send and recvs are
> > >>>>>> posted.. on either ends?
> > >>>>>>
> > >>>>>> On 12/3/06, Gleb Natapov <glebn_at_[hidden]> wrote:
> > >>>>>>> On Sun, Dec 03, 2006 at 07:03:33PM +0530, Chevchenkovic
> > >>>>>>> Chevchenkovic
> > >>>>>> wrote:
> > >>>>>>>> Hi,
> > >>>>>>>> I had this query. I hope some expert replies to it.
> > >>>>>>>> I have 2 nodes connected point-to-point using infiniband
> > >>>>>>>> cable. There
> > >>>>>>>> are multiple LIDs for each of the end node ports.
> > >>>>>>>> When I give an MPI_Send, Are the sends are posted on
> > >>>>>>>> different LIDs
> > >>>>>>>> on each of the end nodes OR they are they posted on the same
> > >>>>>>>> LID?
> > >>>>>>>> Awaiting your reply,
> > >>>>>>> It depend what version of Open MPI your are using. If you are
> > >>>>>>> using
> > >>>>>>> trunk or v1.2 beta then all available LIDs are used in RR
> > >>>>>>> fashion. The
> > >>>>>> early
> > >>>>>>> versions don't support LMC.
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Gleb.
> > >>>>>>> _______________________________________________
> > >>>>>>> users mailing list
> > >>>>>>> users_at_[hidden]
> > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>>>>
> > >>>>>>
> > >>>>> _______________________________________________
> > >>>>> users mailing list
> > >>>>> users_at_[hidden]
> > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>
> > >>>> --
> > >>>> Gleb.
> > >>>> _______________________________________________
> > >>>> users mailing list
> > >>>> users_at_[hidden]
> > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>>>
> > >>> _______________________________________________
> > >>> users mailing list
> > >>> users_at_[hidden]
> > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>
> > >> --
> > >> Gleb.
> > >> _______________________________________________
> > >> users mailing list
> > >> users_at_[hidden]
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > >>
> > > _______________________________________________
> > > users mailing list
> > > users_at_[hidden]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > Server Virtualization Business Unit
> > Cisco Systems
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
			Gleb.