On Apr 19, 2006, at 4:15 PM, Greg Watson wrote:
> We've just run across a rather tricky issue. We're calling
> opal_event_loop() to dispatch orte events to an orted that has been
> launched separately. However if the orted dies for some reason (gets
> a signal or whatever) then opal_event_loop() is calling exit().
> Needless to say, this is not good behavior us. Any suggestions on how
> to get around this problem?
Is the orted you are connecting to the "seed" daemon? I think the
only time we should be exiting like that is if the orted was the seed
daemon. I'm not sure what we want to do if that's the case -- it
looks like we're calling errmgr.abort() when badness happens. I
wonder if your application can provide its own errmgr component that
provides an abort that doesn't actually abort? Just some off the
cuff ideas -- Ralph could probably give a better idea of exactly what
is happening...
Brian
--
Brian Barrett
Open MPI developer
http://www.open-mpi.org/
|