Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Greg Watson (gwatson_at_[hidden])
Date: 2006-04-20 13:25:19


Ok, thanks.

For clarification, the model we're using at the moment looks roughly
like this:

orte_init();

forever () {
        if (do_our_stuff() == GAME_OVER)
                break;
        opal_event_loop(OPAL_EVLOOP_ONCE);
}

orte_finalize();

The simplest change for us would be something like:

orte_init();

forever () {
        if (do_our_stuff() == GAME_OVER)
                break;
        if (opal_event_loop(OPAL_EVLOOP_ONCE) != ORTE_SUCCESS) {
                clean_up_our_stuff();
                break;
        }
}

orte_finalize();

Greg

On Apr 20, 2006, at 10:21 AM, Ralph Castain wrote:

> You make a good point about the library not calling exit(). I'll
> have to recruit some help to look at the notion of opal_even_loop
> returning an error value - it isn't entirely clear who it would
> return it to in our system,. Even though I understand how someone
> in your situation would handle it, I have to ensure that it doesn't
> cause the base system problems, or force a major code revision that
> would need to be scheduled into the project.
>
> We'll have to get back to you on this - most of the folks are at a
> workshop this week, so it will probably be next week before we can
> discuss it.
>
> Ralph
>
>
> Greg Watson wrote:
>> The simplest thing for us would be for opal_event_loop() to return
>> an error value. That way we can detect the situation and clean up
>> our system. At the moment we're not trying to restart orted, so
>> clean recovery of orte is not that important, though ultimately I
>> would think it is desirable. Other alternatives are to pass you an
>> error handler that you call, or you could send a signal that we
>> can trap.
>>
>> >From our perspective, we're simply calling a library that does
>> stuff. Having the library call exit() at any point is a major
>> problem for applications trying to do more than run a single job.
>>
>> Greg
>>
>> On Apr 20, 2006, at 9:40 AM, Ralph Castain wrote:
>>
>>> Well, I actually don't know much about opal_event_loop and/or how
>>> it is intended to work. My guess is that:
>>>
>>> (a) your remote orted is acting as the seed and your local
>>> process (the one in Eclipse) is running as a client to that seed
>>> - at least, that was the case last I talked to Nathan
>>>
>>> (b) when the seed orted dies, it is the oob in your local client
>>> that actually detects socket closure and decides that - since it
>>> is the seed that has lost contact - the local application must
>>> abort.
>>>
>>> (c) the errmgr.abort function does exactly what it was supposed
>>> to do - it provides an immediate way of killing the local process.
>>>
>>> I'd be a little hesitant to recommend overloading the
>>> errmgr.abort function as you really do want the local processes
>>> to die when losing connection to the seed (at least, until we
>>> develop a recovery capability for the seed orted - which is some
>>> ways off), and (given the way you are running) I'm not sure you
>>> can have a different errmgr for your process while leaving the
>>> other one for everyone else.
>>>
>>> Probably the best solution for now would be for us to insert a
>>> (yet another) MCA parameter into the errmgr that would (if set)
>>> have errmgr.abort do something other than exit. The question then
>>> is: what would you want it to do?? We need to have it tell the
>>> rest of the system to stop trying to send messages etc - right
>>> now, I don't think the infrastructure exists to do that short of
>>> killing orte.
>>>
>>> We could try to have errmgr.abort do an orte_finalize - that
>>> would kill the orte system without impacting your host program, I
>>> suspect. You would then have to re-initialize, so we'd have to
>>> find some way to let you know that we had finalized. I can't
>>> swear this will work, though - we might well generate a segfault
>>> since this is happening deep down inside the system. We could try
>>> it, though.
>>>
>>> Would any of that be of help? Do you have any suggestions on how
>>> we might let you know that we had finalized?
>>>
>>> Ralph
>>>
>>>
>>> Brian Barrett wrote:
>>>> On Apr 19, 2006, at 4:15 PM, Greg Watson wrote:
>>>>> We've just run across a rather tricky issue. We're calling
>>>>> opal_event_loop() to dispatch orte events to an orted that has
>>>>> been launched separately. However if the orted dies for some
>>>>> reason (gets a signal or whatever) then opal_event_loop() is
>>>>> calling exit(). Needless to say, this is not good behavior us.
>>>>> Any suggestions on how to get around this problem?
>>>> Is the orted you are connecting to the "seed" daemon? I think
>>>> the only time we should be exiting like that is if the orted was
>>>> the seed daemon. I'm not sure what we want to do if that's the
>>>> case -- it looks like we're calling errmgr.abort() when badness
>>>> happens. I wonder if your application can provide its own errmgr
>>>> component that provides an abort that doesn't actually abort?
>>>> Just some off the cuff ideas -- Ralph could probably give a
>>>> better idea of exactly what is happening... Brian
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel