Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] IOF repair
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-07-10 10:42:51


Can't argue with that...when Jeff gets back from his meeting he forgot
about, we'll chat and see what makes sense to recommend. The current code is
"worse" in the sense that we have this new bad behavior on stdin. It is
"better" in that Rolf and Jeff -did- plug a hole or two from the 1.2 days.

We'll chat about which is worse and get back to the list on it...

On 7/10/08 8:04 AM, "Terry Dontje" <Terry.Dontje_at_[hidden]> wrote:

> This all seems like a 6 of one half dozen of the other decision. Both
> solutions suck because there are holes. So, it comes down to whether we
> think the current code is worse than 1.2 or not. If they are the same
> I'd be inclined to stay with what we have now for fear of inadvertantly
> borking something else by rolling back the iof.
>
> --td
>
> Ralph Castain wrote:
>> I believe the changes all pretty much related to an attempt to fix the
>> iof_flush problem and correction of a different problem affecting the
>> reading of stdin. Unfortunately, the iof_flush problem still remains, albeit
>> perhaps in different form, and we now have a new problem in the stdin
>> behavior.
>>
>> It's the old problem of hastily written spaghetti code to overly-ambitious
>> requirements, subsequently hacked by multiple people multiple times...so now
>> when you put your finger in one hole, a leak springs up somewhere else. :-)
>>
>> Jeff and I are looking at this in more detail and will get back on it later.
>> The changes between 1.2 and 1.3 are not that big in terms of lines of code.
>> It just looks like they are suffering from the kid-and-the-dyke problem.
>>
>> My motivation in proposing the rollback was simply that any attempt to
>> repair this new hole will quite likely open another one somewhere else. So
>> even if we can "fix" the duplicate stdin problem...did the kid really
>> improve the situation?
>>
>>
>>
>> On 7/10/08 7:05 AM, "Terry Dontje" <Terry.Dontje_at_[hidden]> wrote:
>>
>>
>>> I see that Jeff has updated the ticket saying that he is looking at the
>>> code to see if he can generate a fix so the below may be superfluous.
>>>
>>> Anyways, what were the issues fixed in 1.3? I really comes down to how
>>> much more pain are we
>>> giving our users by rolling back to 1.2 or not.
>>>
>>> Note, I am assuming your comment of "most of our clusters do not..." is
>>> referring to LANL's clusters. I do not believe this statement is
>>> correct when you look at the OMPI community as a whole.
>>>
>>> --td
>>>
>>> Ralph Castain wrote:
>>>
>>>> I have been investigating Ticket #1135 - stdin is read twice if rank=0
>>>> shares the node with mpirun. Repairing this problem is going to be quite
>>>> difficult due to the rather terrible spaghetti code in the IOF, and the
>>>> fact
>>>> that the IOF in the HNP actually rml.sends the IO to itself multiple times
>>>> as it cycles through the spaghetti.
>>>>
>>>> Unfortunately, this problem -is- a regression from 1.2. Rather than
>>>> spending
>>>> weeks trying to fix it, I see two approaches we could pursue. First, I
>>>> could
>>>> repair the problem by essentially returning the IOF to its 1.2 state. This
>>>> will have to be done by hand as most of the differences are in function
>>>> calls to utilities that have changed due to the removal of the old NS
>>>> framework. However, there are a few places where the logic itself has been
>>>> modified - and the problem must stem from somewhere in there.
>>>>
>>>> If I make this change, then we will be no better, and no worse, than 1.2.
>>>> Note that we currently advise people to read from a file instead of from
>>>> stdin to avoid other issues that were present in 1.2.
>>>>
>>>> Alternatively, we could ship 1.3 as-is, and warn users (similar to 1.2)
>>>> that
>>>> they should avoiding reading from stdin if there is any chance that rank=0
>>>> could be co-located with mpirun. Note that most of our clusters do not
>>>> allow
>>>> such co-location - but it is permitted by default by OMPI.
>>>>
>>>> We already plan to revisit the IOF at next week's technical meeting, with a
>>>> goal of redefining the IOF's API to a more reduced set that reflects a less
>>>> ambitious requirement. I expect to implement those changes fairly soon
>>>> thereafter, but that would be targeted to 1.4 - not 1.3.
>>>>
>>>> Any thoughts on which way we should go?
>>>> Ralph
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel