Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] IOF repair
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-07-09 20:11:15

I'd like to have a look at the diff between the two, but I can't do so
until tomorrow at the earliest.

On Jul 9, 2008, at 7:26 PM, Ralph Castain wrote:

> I have been investigating Ticket #1135 - stdin is read twice if rank=0
> shares the node with mpirun. Repairing this problem is going to be
> quite
> difficult due to the rather terrible spaghetti code in the IOF, and
> the fact
> that the IOF in the HNP actually rml.sends the IO to itself multiple
> times
> as it cycles through the spaghetti.
> Unfortunately, this problem -is- a regression from 1.2. Rather than
> spending
> weeks trying to fix it, I see two approaches we could pursue. First,
> I could
> repair the problem by essentially returning the IOF to its 1.2
> state. This
> will have to be done by hand as most of the differences are in
> function
> calls to utilities that have changed due to the removal of the old NS
> framework. However, there are a few places where the logic itself
> has been
> modified - and the problem must stem from somewhere in there.
> If I make this change, then we will be no better, and no worse, than
> 1.2.
> Note that we currently advise people to read from a file instead of
> from
> stdin to avoid other issues that were present in 1.2.
> Alternatively, we could ship 1.3 as-is, and warn users (similar to
> 1.2) that
> they should avoiding reading from stdin if there is any chance that
> rank=0
> could be co-located with mpirun. Note that most of our clusters do
> not allow
> such co-location - but it is permitted by default by OMPI.
> We already plan to revisit the IOF at next week's technical meeting,
> with a
> goal of redefining the IOF's API to a more reduced set that reflects
> a less
> ambitious requirement. I expect to implement those changes fairly soon
> thereafter, but that would be targeted to 1.4 - not 1.3.
> Any thoughts on which way we should go?
> Ralph
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

Jeff Squyres
Cisco Systems