On Mar 21, 2011, at 2:51 PM, Hugo Meyer wrote:

Thanks Ralph for your reply.

2011/3/21 Ralph Castain <rhc@open-mpi.org>
You should never access a pointer array's data area that way (i.e., by index against the raw data). You really should do:

if (NULL == (proc = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, vpid))) {
      /* error report */

About this, i've changed this in my code but i'm getting the same result. Null when asking about a dead process.
The errmgr generally doesn't remove a process object upon failure - it just sets its state to some appropriate value. However, depending upon where you are trying to do this, and the history that got you down this code path, it is possible.

I'm writing this code into the errmgr_orted.c, and it is executed when a process fails. 

There's your problem - that module is run in the daemon, where the orte_job_data pointer array isn't used. You have to use the orte_local_jobdata and orte_local_children lists instead. So once the HNP replies with the jobid, you look up the orte_odls_job_t for that job from the orte_local_jobdata list.

If you want to find a particular proc, though, you would look under orte_local_children - search the list for a child whose jobid and vpid both match.

Note that you will not find that child process -unless- the child is under that daemon.

I'm not sure what you are trying to accomplish, so I can't give further advice. Note that daemons have limited knowledge of application processes that are not their own immediate children. What little they know regarding processes other than their own is stored in the nidmap/pidmap arrays - limited to location, local rank, and node rank. They have no storage currently allocated for things like the state of a non-local process.

Also, remember that if you are in a daemon, then the jdata objects are not populated. The daemons work exclusively from the orte_local_jobdata and orte_local_children lists, so you would have to find your process there.

That's why i'm asking to the hnp about the jdata using ORTE_DAEMON_REPORT_JOB_INFO_CMD, i assume that he has the information about the dead process.

Only after the daemon reports it.

Any idea?

Best regards.

Hugo Meyer
devel mailing list