The problem is here:

                                      /* Pack the faulty vpid */
                                        if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, &proc, 1, ORTE_NAME))) {
                                            goto CLEANUP;

The variable proc is apparently a pointer to orte_process_name_t. You therefore should have packed it like this:

                                        /* Pack the faulty vpid */
                                        if (ORTE_SUCCESS != (rc = opal_dss.pack(buffer, proc, 1, ORTE_NAME))) {
                                            goto CLEANUP;

i.e.., without the & in front. Accordingly, the problem was that the HNP was getting garbage for the process name, and thus finding NULL at the specified locations.

Just for testing, you might want to print out the received process name to ensure your communication is correct :-)

On Mar 22, 2011, at 5:58 AM, Hugo Meyer wrote:

Thanks again Ralph for your reply.
There's your problem - that module is run in the daemon, where the orte_job_data pointer array isn't used. You have to use the orte_local_jobdata and orte_local_children lists instead. So once the HNP replies with the jobid, you look up the orte_odls_job_t for that job from the orte_local_jobdata list.

I'm sending now to you all the piece of code involved, at the beginning i'm doing something about what you are saying. Then having the child info i ask to the hnp for the jobdata of the child, but i'm still getting no data about the child (that is the dead process). I'm trying to get this info to send info to another orted to restart this failed process.

I'm not sure what you are trying to accomplish, so I can't give further advice. Note that daemons have limited knowledge of application processes that are not their own immediate children. What little they know regarding processes other than their own is stored in the nidmap/pidmap arrays - limited to location, local rank, and node rank. They have no storage currently allocated for things like the state of a non-local process.

I want to restart the process in another node, that's why i'm needing the jobdata. So, the hnp cannot do something like:
jdata = orte_get_job_data_object(proc.jobid))  

when the proc doesn't belong to him??
So where i can obtain this information, because i'm asumming that i cannot ask about the dead process to his daemon (because i assume that the daemon also is dead, but that's not true). I was supossing that in the HNP i could execute the sentence above.

I'm attaching all the code involving the described situation. But i have made some changes after my first email, but what i'm trying to do is basically the same. In the line 23 of the orted_comm.c, that i'm sending, i'm always getting NULL as a result, so i can't obtain the jdata.

Thanks a lot again for your help.

Best Regards. 

Hugo Meyer