Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Update orte_proc structure
From: Leonardo Fialho (lfialho_at_[hidden])
Date: 2008-10-01 12:40:36


Hi Ralh,

My mistake. When I said orted which acts as HNP, in really it is the
mpirun. I just want to store one new information in orte_proc in share
this information with new deamons ans tools which connect to RTE.

Thanks,
Leonardo

Ralph Castain escribió:
> I'm not entirely sure what you are doing here. The orte_job_t,
> orte_node_t, and orte_proc_t objects are only used on mpirun - the
> arrays built from those objects are only defined on mpirun itself, not
> on any orted.
>
> When you say "orte daemon which acts as HNP", are you implying that
> you have some orted out there that is trying to behave like an HNP? Or
> do you really mean mpirun itself?
>
> I suspect the reason you are seeing a difference is that orte-ps only
> gets its info from mpirun, and you are somehow storing the modified
> data on an orted instead.
>
> Did you modify orted itself to create and store an orte_job_t array?
> This would not be a good idea as a significant amount of code in the
> system expects that array to only exist inside of mpirun. You could
> run into some really strange behavior in various scenarios.
>
> Ralph
>
>
> On Oct 1, 2008, at 9:09 AM, Leonardo Fialho wrote:
>
>> Hi All,
>>
>> I have a little doubt about how to update the orte_proc structure.
>>
>> I have modified the orte_proc structure to include another field
>> (orte_name_proc_t type) to describe the node whose store my
>> checkpoints and logs:
>>
>> struct orte_proc_t {
>> ...
>> #if OPAL_ENABLE_FT_RADIC == 1
>> /* protector node */
>> orte_process_name_t protector;
>> #endif
>> };
>>
>> Thus, I have added in orted_comm.c a code which I think that would
>> update de job structure:
>> /* Update the structure */
>> if (NULL == (jdata = orte_get_job_data_object(sender_jobid))) {
>> ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
>> goto CLEANUP;
>> }
>> procs = (orte_proc_t**)jdata->procs->addr;
>> if (NULL == procs[sender_vpid] ) {
>> ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
>> goto CLEANUP;
>> }
>> procs[sender_vpid]->protector.jobid = protector_jobid;
>> procs[sender_vpid]->protector.vpid = protector_vpid;
>> opal_output(0, "%s is the protector of %s",
>> ORTE_NAME_PRINT(&procs[sender_vpid]->name),
>> ORTE_NAME_PRINT(&procs[sender_vpid]->protector));
>>
>> In the log of the orte daemon which acts as HNP I can see correct
>> informations which was added to the orte_proc structure, but, when I
>> use my modified version of orte-ps I found incorrect information
>> ([[INVALID],INVALID]). Bellow is the code I have used in orte-ps:
>>
>> #if OPAL_ENABLE_FT_RADIC == 1
>> protector = orte_util_print_name_args(&vpid->protector);
>> printf("%*s |", len_protector, protector);
>> #endif
>>
>> The question is: why the HNP show the correct information, and the
>> orte-ps don´t?
>>
>> Thanks
>> --
>>
>> Leonardo Fialho
>> Computer Architecture and Operating Systems Department - CAOS
>> Universidad Autonoma de Barcelona - UAB
>> ETSE, Edifcio Q, QC/3088
>> http://www.caos.uab.es
>> Phone: +34-93-581-2888
>> Fax: +34-93-581-2478
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478