Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] PBS tm error returns
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-08-17 13:48:15


Hi David

You are quite correct. IIRC, we didn't bother checking the local_err because
we found it to be unreliable - all Torque checks is that the program exec's.
It doesn't report back an error if it segfaults instantly, for example, or
aborts because it fails to find a required library. So we added a simple
timer that declares the launch a failure if the daemon(s) fail to report
back in a specified time.
Hi David

This didn't cause any problems, so I went ahead and put it in our devel
trunk. Barring any subsequent error reports, I'll move it over to the 1.3
series.

Thanks!
Ralph

> However, it can't hurt to check the flag as well. I'll test it out first
> just to ensure we don't get false failures.
>
> Thanks
> Ralph
>
> On Aug 12, 2009, at 11:33 PM, David Singleton wrote:
>>
>>
>> Maybe this should go to the devel list but I'll start here.
>>
>> In tracking the way the PBS tm API propagates error information
>> back to clients, I noticed that Open MPI is making an incorrect
>> assumption. (I'm looking 1.3.2.) The relevant code in
>> orte/mca/plm/tm/plm_tm_module.c is:
>>
>> /* TM poll for all the spawns */
>> for (i = 0; i < launched; ++i) {
>> rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
>> if (TM_SUCCESS != rc) {
>> errno = local_err;
>> opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
>> " return status = %d", rc);
>> goto cleanup;
>> }
>> }
>>
>> My reading of the way the tm API works is that tm_poll() can (will)
>> return TM_SUCCESS(0) even when the tm_spawn event being waited on failed,
>> i.e. local_err needs to be checked even if rc=0. It looks like TM_
>> errors (rc values) are from tm protocol failures or incorrect calls
>> to tm. local_err is to do with why the actual requested action failed
>> and is usually some sort of internal PBSE_ error code. In fact it's
>> probably always PBSE_SYSTEM (15010) - I think it is for tm_spawn().
>>
>> Something like the following is probably closer to what is needed.
>>
>> /* TM poll for all the spawns */
>> for (i = 0; i < launched; ++i) {
>> rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
>> if (TM_SUCCESS != rc) {
>> errno = local_err;
>> opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
>> " return status = %d", rc);
>> goto cleanup;
>> }
>> if (local_err!=0) {
>> errno = local_err;
>> opal_output(0, "plm:tm: failed to spawn daemon,"
>> " error code = %d", errno );
>> goto cleanup;
>> }
>> }
>>
>> I checked torque 2.3.3 to confirm that it's tm behaviour is the same as
>> OpenPBS in this respect. No idea about PBSPro.
>>
>>
>> David
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>