Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] PBS tm error returns
From: David Singleton (David.Singleton_at_[hidden])
Date: 2009-08-13 01:33:51


Maybe this should go to the devel list but I'll start here.

In tracking the way the PBS tm API propagates error information
back to clients, I noticed that Open MPI is making an incorrect
assumption. (I'm looking 1.3.2.) The relevant code in
orte/mca/plm/tm/plm_tm_module.c is:

     /* TM poll for all the spawns */
     for (i = 0; i < launched; ++i) {
         rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
         if (TM_SUCCESS != rc) {
             errno = local_err;
             opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
                            " return status = %d", rc);
             goto cleanup;
         }
     }

My reading of the way the tm API works is that tm_poll() can (will)
return TM_SUCCESS(0) even when the tm_spawn event being waited on failed,
i.e. local_err needs to be checked even if rc=0. It looks like TM_
errors (rc values) are from tm protocol failures or incorrect calls
to tm. local_err is to do with why the actual requested action failed
and is usually some sort of internal PBSE_ error code. In fact it's
probably always PBSE_SYSTEM (15010) - I think it is for tm_spawn().

Something like the following is probably closer to what is needed.

     /* TM poll for all the spawns */
     for (i = 0; i < launched; ++i) {
         rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
         if (TM_SUCCESS != rc) {
             errno = local_err;
             opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
                            " return status = %d", rc);
             goto cleanup;
         }
        if (local_err!=0) {
             errno = local_err;
             opal_output(0, "plm:tm: failed to spawn daemon,"
                            " error code = %d", errno );
             goto cleanup;
         }
     }

I checked torque 2.3.3 to confirm that it's tm behaviour is the same as
OpenPBS in this respect. No idea about PBSPro.

David