Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] IOF repair
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2008-07-10 09:05:34

I see that Jeff has updated the ticket saying that he is looking at the
code to see if he can generate a fix so the below may be superfluous.

Anyways, what were the issues fixed in 1.3? I really comes down to how
much more pain are we
giving our users by rolling back to 1.2 or not.

Note, I am assuming your comment of "most of our clusters do not..." is
referring to LANL's clusters. I do not believe this statement is
correct when you look at the OMPI community as a whole.


Ralph Castain wrote:
> I have been investigating Ticket #1135 - stdin is read twice if rank=0
> shares the node with mpirun. Repairing this problem is going to be quite
> difficult due to the rather terrible spaghetti code in the IOF, and the fact
> that the IOF in the HNP actually rml.sends the IO to itself multiple times
> as it cycles through the spaghetti.
> Unfortunately, this problem -is- a regression from 1.2. Rather than spending
> weeks trying to fix it, I see two approaches we could pursue. First, I could
> repair the problem by essentially returning the IOF to its 1.2 state. This
> will have to be done by hand as most of the differences are in function
> calls to utilities that have changed due to the removal of the old NS
> framework. However, there are a few places where the logic itself has been
> modified - and the problem must stem from somewhere in there.
> If I make this change, then we will be no better, and no worse, than 1.2.
> Note that we currently advise people to read from a file instead of from
> stdin to avoid other issues that were present in 1.2.
> Alternatively, we could ship 1.3 as-is, and warn users (similar to 1.2) that
> they should avoiding reading from stdin if there is any chance that rank=0
> could be co-located with mpirun. Note that most of our clusters do not allow
> such co-location - but it is permitted by default by OMPI.
> We already plan to revisit the IOF at next week's technical meeting, with a
> goal of redefining the IOF's API to a more reduced set that reflects a less
> ambitious requirement. I expect to implement those changes fairly soon
> thereafter, but that would be targeted to 1.4 - not 1.3.
> Any thoughts on which way we should go?
> Ralph
> _______________________________________________
> devel mailing list
> devel_at_[hidden]