Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Update orte_proc structure
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-10-01 12:34:11


I'm not entirely sure what you are doing here. The orte_job_t,
orte_node_t, and orte_proc_t objects are only used on mpirun - the
arrays built from those objects are only defined on mpirun itself, not
on any orted.

When you say "orte daemon which acts as HNP", are you implying that
you have some orted out there that is trying to behave like an HNP? Or
do you really mean mpirun itself?

I suspect the reason you are seeing a difference is that orte-ps only
gets its info from mpirun, and you are somehow storing the modified
data on an orted instead.

Did you modify orted itself to create and store an orte_job_t array?
This would not be a good idea as a significant amount of code in the
system expects that array to only exist inside of mpirun. You could
run into some really strange behavior in various scenarios.

Ralph

On Oct 1, 2008, at 9:09 AM, Leonardo Fialho wrote:

> Hi All,
>
> I have a little doubt about how to update the orte_proc structure.
>
> I have modified the orte_proc structure to include another field
> (orte_name_proc_t type) to describe the node whose store my
> checkpoints and logs:
>
> struct orte_proc_t {
> ...
> #if OPAL_ENABLE_FT_RADIC == 1
> /* protector node */
> orte_process_name_t protector;
> #endif
> };
>
> Thus, I have added in orted_comm.c a code which I think that would
> update de job structure:
> /* Update the structure */
> if (NULL == (jdata = orte_get_job_data_object(sender_jobid))) {
> ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
> goto CLEANUP;
> }
> procs = (orte_proc_t**)jdata->procs->addr;
> if (NULL == procs[sender_vpid] ) {
> ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
> goto CLEANUP;
> }
> procs[sender_vpid]->protector.jobid = protector_jobid;
> procs[sender_vpid]->protector.vpid = protector_vpid;
> opal_output(0, "%s is the protector of %s",
> ORTE_NAME_PRINT(&procs[sender_vpid]->name),
> ORTE_NAME_PRINT(&procs[sender_vpid]->protector));
>
> In the log of the orte daemon which acts as HNP I can see correct
> informations which was added to the orte_proc structure, but, when I
> use my modified version of orte-ps I found incorrect information
> ([[INVALID],INVALID]). Bellow is the code I have used in orte-ps:
>
> #if OPAL_ENABLE_FT_RADIC == 1
> protector = orte_util_print_name_args(&vpid->protector);
> printf("%*s |", len_protector, protector);
> #endif
>
> The question is: why the HNP show the correct information, and the
> orte-ps don´t?
>
> Thanks
> --
>
> Leonardo Fialho
> Computer Architecture and Operating Systems Department - CAOS
> Universidad Autonoma de Barcelona - UAB
> ETSE, Edifcio Q, QC/3088
> http://www.caos.uab.es
> Phone: +34-93-581-2888
> Fax: +34-93-581-2478
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel