Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] PBS tm error returns
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-08-17 13:48:15


Hi David

You are quite correct. IIRC, we didn't bother checking the local_err because
we found it to be unreliable - all Torque checks is that the program exec's.
It doesn't report back an error if it segfaults instantly, for example, or
aborts because it fails to find a required library. So we added a simple
timer that declares the launch a failure if the daemon(s) fail to report
back in a specified time.
Hi David

This didn't cause any problems, so I went ahead and put it in our devel
trunk. Barring any subsequent error reports, I'll move it over to the 1.3
series.

Thanks!
Ralph

> However, it can't hurt to check the flag as well. I'll test it out first
> just to ensure we don't get false failures.
>
> Thanks
> Ralph
>
> On Aug 12, 2009, at 11:33 PM, David Singleton wrote:
>>
>>
>> Maybe this should go to the devel list but I'll start here.
>>
>> In tracking the way the PBS tm API propagates error information
>> back to clients, I noticed that Open MPI is making an incorrect
>> assumption. (I'm looking 1.3.2.) The relevant code in
>> orte/mca/plm/tm/plm_tm_module.c is:
>>
>> /* TM poll for all the spawns */
>> for (i = 0; i < launched; ++i) {
>> rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
>> if (TM_SUCCESS != rc) {
>> errno = local_err;
>> opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
>> " return status = %d", rc);
>> goto cleanup;
>> }
>> }
>>
>> My reading of the way the tm API works is that tm_poll() can (will)
>> return TM_SUCCESS(0) even when the tm_spawn event being waited on failed,
>> i.e. local_err needs to be checked even if rc=0. It looks like TM_
>> errors (rc values) are from tm protocol failures or incorrect calls
>> to tm. local_err is to do with why the actual requested action failed
>> and is usually some sort of internal PBSE_ error code. In fact it's
>> probably always PBSE_SYSTEM (15010) - I think it is for tm_spawn().
>>
>> Something like the following is probably closer to what is needed.
>>
>> /* TM poll for all the spawns */
>> for (i = 0; i < launched; ++i) {
>> rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
>> if (TM_SUCCESS != rc) {
>> errno = local_err;
>> opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
>> " return status = %d", rc);
>> goto cleanup;
>> }
>> if (local_err!=0) {
>> errno = local_err;
>> opal_output(0, "plm:tm: failed to spawn daemon,"
>> " error code = %d", errno );
>> goto cleanup;
>> }
>> }
>>
>> I checked torque 2.3.3 to confirm that it's tm behaviour is the same as
>> OpenPBS in this respect. No idea about PBSPro.
>>
>>
>> David
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>