In my case, the error happens in :
ibv_create_cq() returns an error which goes up until
mca_btl_openib_add_procs(). As george mentionned, the openib btl should be
completely ignored, since the bitmask is not taken into account when a
error is returned. However -I don't know why- openib get called again and
So, yes, there must be a bug in openib.
And I know this is how you guys designed the bml layer. But I was hoping
we could improve the design to improve error handling.
So, this is my last try to explain my opinion. If you disagree, then we'll
fix this on the openib side.
Ignoring BTL errors bugs me because the current errors are all serious.
Our try to continue will therefore always leads to a crash (George, you
introduced an error return code, not a real error, hence you managed to
continue). This confuses the user as of why we have a problem, because the
first serious error will be flooded by further errors or crashes. This is
true for openib, but also for sm (I would like to stop on the first
"malloc()" that fails).
We have a two-level system (bitmask + return code) we could use to handle
non severe errors (bitmask) and severe errors (return code). Currently, we
just use the return code as a way to ignore the bitmask, but we could use
the return code as a more serious message and thus improve our error
To sum up, my proposition is to change the meaning of an error return code
in add_procs() from "I got a problem, continue without me" which can be
perfectly handled with the bitmask alone, to "I got a fatal error, please
stop the application".
I know this can be seen as an attempt to prevent fixing a bug in openib by
changing the design of the BML, but in this case, I think changing the BML
design would improve the overall behavior.
On Fri, 28 May 2010, Jeff Squyres wrote:
> To that point, where exactly in the openib BTL init / query sequence is
> it returning an error for you, Sylvain? Is it just a matter of tidying
> something up properly before returning the error?
> On May 28, 2010, at 2:21 PM, George Bosilca wrote:
>> On May 28, 2010, at 10:03 , Sylvain Jeaugey wrote:
>>> On Fri, 28 May 2010, Jeff Squyres wrote:
>>>> On May 28, 2010, at 9:32 AM, Jeff Squyres wrote:
>>>>> Understood, and I agreed that the bug should be fixed. Patches would be welcome. :-)
>>> I sent a patch on the bml layer in my first e-mail. We will apply it on our tree, but as always we're trying to send patches back to open-source (that was not my intent to start such a debate).
>> The only problem with your patch is that it solve something that is not supposed to happen. As a proof of concept I did return errors from the tcp and sm BTLs, and Open MPI gracefully deal with them. So, it is not a matter of aborting we're looking at is a matter of the opebib BTL doing something it is not supposed to do.
>> Going through the code it looks like the bitmask doesn't matter, if an error is returned by a BTL we zero the bitmask and continue to another BTL.
>> Example: the SM BTL returns OMPI_ERROR after creating all the internal structures.
>>>> mpirun -np 4 --host node01 --mca btl sm,self ./ring
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications. This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes. This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other. This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>> Process 1 ([[22047,1],3]) is on host: node01
>> Process 2 ([[22047,1],0]) is on host: node01
>> BTLs attempted: self sm
>> Your MPI job is now going to abort; sorry.
>> Now if I allow TCP on the node:
>>>> mpirun -np 4 --host node01 --mca btl sm,self,tcp ./ring
>> Process 0 sending 10 to 1, tag 201 (4 procs in ring)
>> Process 0 sent to 1
>> Process 3 exiting
>> Process 0 decremented num: 9
>> Process 0 decremented num: 8
>> Thus, Open MPI does the right thing when the BTLs are playing the game.
>>>> I should clarify rather than being flip:
>>>> 1. I agree: the bug should be fixed. Clearly, we should never crash.
>>>> 2. After the bug is fixed, there is clearly a choice: some people may want to use a different transport if a given BTL is unavailable. Others may want to abort. Once the bug is fixed, this seems like a pretty straightforward thing to add.
>>> If you use my patch, you still have no choice. Errors on BTLs lead to an immediate stop instead of trying to continue (and crash).
>>> If someone wants to go further on this, then that's great. If nobody does, I think you should take my patch. Maybe it's not the best solution, but it's still better than the current state.
>>> devel mailing list
>> devel mailing list
> Jeff Squyres
> For corporate legal information go to:
> devel mailing list