Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Trunk returned to normal operations
From: Ralph H Castain (rhc_at_[hidden])
Date: 2008-02-28 09:36:36

On 2/28/08 7:27 AM, "Aurélien Bouteiller" <bouteill_at_[hidden]> wrote:

> Hi Ralph,
> Camille and myself are working also on improving the restart ability
> of orte2. We are focusing on restarting individual processes (while
> Josh needs to restart the entire job). However I guess most of the
> functionalities are similar. Could we join your discussions on point 3 ?

Certainly! The discussion is about what to do with orte/runtime/orte_cr.c.
That code basically does a reset of the ORTE system and then works through
an abbreviated form of orte_init. In doing this, it used the old
orte_sds.set_my_name API, which is no longer supported, and had a lot of
duplicated code that used to be in orte_init.

However, the ESS now does the rte_init so that it can be tailored to the
local environment. If we move orte_cr into a new orte_ess.restart (or
whatever) API, then we could also allow each environment to tailor the
restart procedure to whatever it specifically needs.

Alternatively, we could just restore a "set_my_name" API to the ESS.

My personal preference is the first option as I feel it gives us the most
flexibility, and can extend C/R to other environments. But I leave that for
those working in C/R to decide.


> Aurelien
> Le 27 févr. 08 à 21:47, Ralph Castain a écrit :
>> Hi folks
>> Okay, the ORTE merge appears to have gone well and is now complete -
>> you are
>> free to use the trunk.
>> A few caveats:
>> 1. obviously, you will need to autogen/configure once you update. I
>> -strongly- recommend you rm -rf your install directory first as you
>> will
>> definitely be hit with stale libraries from this commit
>> 2. this is a "drop" from the ORTE devel effort. As such, it is -not-
>> complete. There are several known issues, particularly with
>> comm_spawn and
>> singleton comm_spawn in certain environments and scenarios. I have a
>> "fix"
>> already done and ready to be applied for the comm_spawn problems,
>> but I want
>> to test it some more in the morning before committing it to the
>> trunk - and
>> I didn't want to delay this merge any longer.
>> 3. we know that checkpoint/restart is currently broken. Josh and I
>> have
>> discussed a couple of options for repairing it, and he will look at
>> it as
>> soon as he has a chance. It isn't a big problem - just need to
>> decide which
>> option he would prefer to pursue.
>> The remaining ORTE scalability work should be moving into the trunk
>> over the
>> next few weeks (I will be on vacation 3/7-14, so it will likely take
>> through
>> March). We do not anticipate any API changes or framework adds/
>> deletes the
>> rest of the way - there will be a few new components added to existing
>> frameworks, some revamp of the logic in a few places, etc.
>> I will try to cover all the changes in one or two notes over the
>> next few
>> days to avoid carpal tunnel. Please feel free to ask questions and
>> I'll do
>> my best to provide answers.
>> Thanks again for the cooperation tonight...
>> Ralph
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> _______________________________________________
> devel mailing list
> devel_at_[hidden]