Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] PBS tm error returns
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-08-13 07:42:52


Hi David

You are quite correct. IIRC, we didn't bother checking the local_err
because we found it to be unreliable - all Torque checks is that the
program exec's. It doesn't report back an error if it segfaults
instantly, for example, or aborts because it fails to find a required
library. So we added a simple timer that declares the launch a failure
if the daemon(s) fail to report back in a specified time.

However, it can't hurt to check the flag as well. I'll test it out
first just to ensure we don't get false failures.

Thanks
Ralph

On Aug 12, 2009, at 11:33 PM, David Singleton wrote:

>
> Maybe this should go to the devel list but I'll start here.
>
> In tracking the way the PBS tm API propagates error information
> back to clients, I noticed that Open MPI is making an incorrect
> assumption. (I'm looking 1.3.2.) The relevant code in
> orte/mca/plm/tm/plm_tm_module.c is:
>
> /* TM poll for all the spawns */
> for (i = 0; i < launched; ++i) {
> rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
> if (TM_SUCCESS != rc) {
> errno = local_err;
> opal_output(0, "plm:tm: failed to poll for a spawned
> daemon,"
> " return status = %d", rc);
> goto cleanup;
> }
> }
>
> My reading of the way the tm API works is that tm_poll() can (will)
> return TM_SUCCESS(0) even when the tm_spawn event being waited on
> failed,
> i.e. local_err needs to be checked even if rc=0. It looks like TM_
> errors (rc values) are from tm protocol failures or incorrect calls
> to tm. local_err is to do with why the actual requested action failed
> and is usually some sort of internal PBSE_ error code. In fact it's
> probably always PBSE_SYSTEM (15010) - I think it is for tm_spawn().
>
> Something like the following is probably closer to what is needed.
>
> /* TM poll for all the spawns */
> for (i = 0; i < launched; ++i) {
> rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
> if (TM_SUCCESS != rc) {
> errno = local_err;
> opal_output(0, "plm:tm: failed to poll for a spawned
> daemon,"
> " return status = %d", rc);
> goto cleanup;
> }
> if (local_err!=0) {
> errno = local_err;
> opal_output(0, "plm:tm: failed to spawn daemon,"
> " error code = %d", errno );
> goto cleanup;
> }
> }
>
> I checked torque 2.3.3 to confirm that it's tm behaviour is the same
> as
> OpenPBS in this respect. No idea about PBSPro.
>
>
> David
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users