On May 31, 2010, at 10:27 AM, Ralph Castain wrote:
> Just curious - your proposed fix sounds exactly like what was done in the OPAL SOS work. Are you therefore proposing to use SOS to provide a more informational status return?
No, I think Sylvain's talking about slightly modifying the existing mechanism:
1. Return OMPI_SUCCESS: bml then obeys whatever is in the connectivity bitmask -- even if the bitmask indicates that this BTL can't talk to anyone.
2. Return != OMPI_SUCCESS: treat the problem as a fatal error.
I think Sylvain's point is that OMPI_SUCCESS can be returned for non-fatal errors if a BTL just wants to be ignored. In such cases, the BTL can just set its connectivity mask to 0. This will allow OMPI to continue the job.
For example, if verbs is borked (e.g., can't create CQ's), it can return a connectivity mask of 0 and OMPI_SUCCESS. The BML is then free to fail over to some other BTL.
But if a malloc() fails down in some BTL, then the job is hosed anyway -- so why not return != OMPI_SUCCESS and try to abort cleanly?
For sites that want to treat verbs failures as fatal, we can add a new MCA param either in the openib BTL that says "treat all init failures as fatal to the job" or perhaps a new MCA param in R2 that says "if the connectivity map for BTL <list> is empty, abort the job". Or something like that.
> If so, then it would seem the only real dispute here is: is there -any- condition whereby a given BTL should have the authority to tell OMPI to terminate an application, even if other BTLs could still function?
I think his cited example was if malloc() fails.
I could see some sites wanting to abort if their high-speed network was down (e.g., MX or openib BTLs failed to init) -- they wouldn't want OMPI to fail over to TCP. The flip side of this argument is that the sysadmin could set "btl = ^tcp" in the system file, and then if openib/mx fails, the BML will abort because some peers won't be reachable.
> I understand that the current code may not yet support that operation, but I do believe that was the intent of the design. So only the case where -all- BTLs say "I can't do it" would result in termination. Rather than change that design, I believe the intent is to work towards completing that implementation. In the interim, it would seem most sensible to me that we add an MCA param that specifies the termination behavior (i.e., attempt to continue or terminate on first fatal BTL error).
I think that there are multiple different exit conditions from a BTL init:
1. BTL succeeded in initializing, and some peers are reachable
2. BTL succeeded in initializing, and no peers are reachable
3. BTL failed to initialize, but that failure is localized to the BTL (e.g., openib failed to create a CQ)
4. BTL failed to initialize, and the error is global in nature (e.g., malloc() fail)
I think it might be a site-specific decision as to whether to abort the job for condition 3 or not. Today we default to not failing and pair that with an indirect method of failing (i.e., setting btl=^tcp).
For corporate legal information go to: