Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] orte\mca\smr
From: Ralph H Castain (rhc_at_[hidden])
Date: 2008-03-17 12:52:48


Hello

As Jeff stated, the smr has been removed from the system. We did this
because experience showed that monitoring process/node status was highly
system dependent and directly correlated with the launch system. Thus, it
made no sense to separate those two functions.

For example, we have successfully prototyped the detection of orted/node
failure on TM based on notification from Torque when the orted fails. A
similar approach appears to be working under SLURM (one glitch remains to be
ironed out).

I would think that a heartbeat protocol would primarily have applicability
in the RSH environment. We certainly wouldn't want to do it in TM or SLURM,
and I suspect that most of the other managed environments have similar
detection mechanisms.

If you think there are other environments that also would need a heartbeat,
then you could put it in the PLM base and people can call it if they want to
use it. My only caveat there is that it increases our binary size since base
functions are always compiled, so we would only want to do that if we really
thought multiple environments would use it. If it is only RSH, then it would
probably better be inserted into the RSH PLM module.

Hope that helps
Ralph

On 3/10/08 9:16 AM, "Leonardo Fialho" <lfialho_at_[hidden]> wrote:

> Hi Jeff,
>
> I need to implement a heart bit/watchdog monitoring system, I´m looking
> for the "best place" to put it and I don´t want to put duplicated code.
> I´ll try to put it into PLM for now, and when I get a Ralph´s response I
> change it, if necessary.
>
> Jeff Squyres escribió:
>> Yes, it all got consolidated down into plm. We need to update the
>> FAQ; the ORTE frameworks changed quite a bit in the recent ORTE merge...
>>
>> Ralph's on vacation this week. A detailed answer to your question may
>> not occur until he returns...
>>
>>
>> On Mar 10, 2008, at 10:05 AM, Leonardo Fialho wrote:
>>
>>
>>> Hi all,
>>>
>>> Where is the "old" orte\mca\smr? I don´t found it in orte/mca/plm...
>>>
>>> --
>>> Leonardo Fialho
>>> Computer Architecture and Operating Systems Department - CAOS
>>> Universidad Autonoma de Barcelona - UAB
>>> ETSE, Edifcio Q, QC/3088
>>> http://www.caos.uab.es
>>> Phone: +34-93-581-2888
>>> Fax: +34-93-581-2478
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>>
>