Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] BTL add procs errors
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-05-28 15:00:16


To that point, where exactly in the openib BTL init / query sequence is it returning an error for you, Sylvain? Is it just a matter of tidying something up properly before returning the error?

On May 28, 2010, at 2:21 PM, George Bosilca wrote:

> On May 28, 2010, at 10:03 , Sylvain Jeaugey wrote:
>
> > On Fri, 28 May 2010, Jeff Squyres wrote:
> >
> >> On May 28, 2010, at 9:32 AM, Jeff Squyres wrote:
> >>
> >>> Understood, and I agreed that the bug should be fixed. Patches would be welcome. :-)
> > I sent a patch on the bml layer in my first e-mail. We will apply it on our tree, but as always we're trying to send patches back to open-source (that was not my intent to start such a debate).
>
> The only problem with your patch is that it solve something that is not supposed to happen. As a proof of concept I did return errors from the tcp and sm BTLs, and Open MPI gracefully deal with them. So, it is not a matter of aborting we're looking at is a matter of the opebib BTL doing something it is not supposed to do.
>
> Going through the code it looks like the bitmask doesn't matter, if an error is returned by a BTL we zero the bitmask and continue to another BTL.
>
> Example: the SM BTL returns OMPI_ERROR after creating all the internal structures.
>
> >> mpirun -np 4 --host node01 --mca btl sm,self ./ring
>
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications. This means that no Open MPI device has indicated
> that it can be used to communicate between these processes. This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other. This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
> Process 1 ([[22047,1],3]) is on host: node01
> Process 2 ([[22047,1],0]) is on host: node01
> BTLs attempted: self sm
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
>
> Now if I allow TCP on the node:
> >> mpirun -np 4 --host node01 --mca btl sm,self,tcp ./ring
>
> Process 0 sending 10 to 1, tag 201 (4 procs in ring)
> Process 0 sent to 1
> Process 3 exiting
> Process 0 decremented num: 9
> Process 0 decremented num: 8
>
> Thus, Open MPI does the right thing when the BTLs are playing the game.
>
> george.
>
> >
> >> I should clarify rather than being flip:
> >>
> >> 1. I agree: the bug should be fixed. Clearly, we should never crash.
> >>
> >> 2. After the bug is fixed, there is clearly a choice: some people may want to use a different transport if a given BTL is unavailable. Others may want to abort. Once the bug is fixed, this seems like a pretty straightforward thing to add.
> > If you use my patch, you still have no choice. Errors on BTLs lead to an immediate stop instead of trying to continue (and crash).
> > If someone wants to go further on this, then that's great. If nobody does, I think you should take my patch. Maybe it's not the best solution, but it's still better than the current state.
> >
> > Sylvain
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/