Maybe this should go to the devel list but I'll start here.
In tracking the way the PBS tm API propagates error information
back to clients, I noticed that Open MPI is making an incorrect
assumption. (I'm looking 1.3.2.) The relevant code in
orte/mca/plm/tm/plm_tm_module.c is:
/* TM poll for all the spawns */
for (i = 0; i < launched; ++i) {
rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
if (TM_SUCCESS != rc) {
errno = local_err;
opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
" return status = %d", rc);
goto cleanup;
}
}
My reading of the way the tm API works is that tm_poll() can (will)
return TM_SUCCESS(0) even when the tm_spawn event being waited on failed,
i.e. local_err needs to be checked even if rc=0. It looks like TM_
errors (rc values) are from tm protocol failures or incorrect calls
to tm. local_err is to do with why the actual requested action failed
and is usually some sort of internal PBSE_ error code. In fact it's
probably always PBSE_SYSTEM (15010) - I think it is for tm_spawn().
Something like the following is probably closer to what is needed.
/* TM poll for all the spawns */
for (i = 0; i < launched; ++i) {
rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
if (TM_SUCCESS != rc) {
errno = local_err;
opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
" return status = %d", rc);
goto cleanup;
}
if (local_err!=0) {
errno = local_err;
opal_output(0, "plm:tm: failed to spawn daemon,"
" error code = %d", errno );
goto cleanup;
}
}
I checked torque 2.3.3 to confirm that it's tm behaviour is the same as
OpenPBS in this respect. No idea about PBSPro.
David
|