Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Device failover on ob1
From: Brian W. Barrett (brbarret_at_[hidden])
Date: 2009-08-03 11:23:25

On Sun, 2 Aug 2009, Ralph Castain wrote:

> Perhaps a bigger question needs to be addressed - namely, does the ob1 code
> need to be refactored?
> Having been involved a little in the early discussion with bull when we
> debated over where to put this, I know the primary concern was that the code
> not suffer the same fate as the dr module. We have since run into a similar
> issue with the checksum module, so I know where they are coming from.
> The problem is that the code base is adjusted to support changes in ob1,
> which is still being debugged. On the order of 95% of the code in ob1 is
> required to be common across all the pml modules, so the rest of us have to
> (a) watch carefully all the commits to see if someone touches ob1, and then
> (b) manually mirror the change in our modules.
> This is not a supportable model over the long-term, which is why dr has died,
> and checksum is considering integrating into ob1 using configure #if's to
> avoid impacting non-checksum users. Likewise, device failover has been
> treated similarly here - i.e., configure out the added code unless someone
> wants it.
> This -does- lead to messier source code with these #if's in it. If we can
> refactor the ob1 code so the common functionality resides in the base, then
> perhaps we can avoid this problem.
> Is it possible?

I think Ralph raises a good point - we need to think about how to allow
better use of OB1's code base between consumers like checksum and
failover. The current situation is problematic to me, for the reasons
Ralph cited. However, since the ob1 structures and code have little use
for PMLs such as CM, I'd rather not push the code into the base - in the
end, it's very specific to a particular PML implementation and the code
pushed into the base already made things much more interesting in
implementing CM than I would have liked. DR is different in this
conversation, as it was almost entirely a seperate implementation from ob1
by the end, due to the removal of many features and the addition of many

However, I think there's middle ground here which could greatly improve
the current situation. With the proper refactoring, there's no technical
reason why we couldn't move the checksum functionality into ob1 and add
the failover to ob1, with no impact on performance when the functionality
isn't used and little impact on code readability.

So, in summary, refactor OB1 to support checksum / failover good, pushing
ob1 code into base bad.