Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Uninitialized ORTE epoch values
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-08-05 18:31:27


Thanks for the explanation. It kinda begs a question, though - I've noticed that the assignment of epoch seems to circle around in a number of places. We call the ess_base function to get_epoch, and then we assign an epoch. But the base function actually seem to do much, if anything.

It's somewhat confusing and difficult to trace. I know Wes and I already planned to cleanup some of this once we get back to the orte state machine work, but I'm hoping we can simplify this code somewhat to make it easier to understand and follow.

Meantime, we'll continue to chase down the problems.

On Aug 5, 2011, at 4:17 PM, Thomas Herault wrote:

>
> The warnings issued through ess_base_select.c:46 are annoying but harmless. Wesley is going to hunt them and remove them, but they are really issued because of the print:
> orte_ess_base_proc_get_epoch (ess_base_select.c:46) calls ORTE_NAME_PRINT(proc), which prints proc->epoch, before proc->epoch is assigned to the local computed value epoch. This assignment is done in the level just above orte_ess_base_proc_get_epoch: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) says proc->epoch = orte_ess_base_proc_get_epoch(proc);
>
> Wesley is going to find where this proc was created to ensure that its epoch field is initialized to INVALID_EPOCH, but what this trace says is really that nothing references it before it is initialized to its correct value.
>
> Thomas
>
> Le 5 août 2011 à 16:52, Ralph Castain a écrit :
>
>> Thanks Wes - it isn't the print that's the issue, it's the fact that we have epochs that aren't being initialized, and what else that may be causing to have problems.
>>
>>
>> On Aug 5, 2011, at 2:45 PM, Wesley Bland wrote:
>>
>>> I don't think these are anything to worry about since they're all print statements, but I will work on these tonight.
>>>
>>> On Fri, Aug 5, 2011 at 3:03 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>>> Ralph and I are trying to track down the mysterious ORTE error.
>>>
>>> In doing so, I have found at least one fairly repeatable error on my cluster: when running through SLURM the ibm/dynamic/spawn test, where we mpirun 3 procs and then we MPI_COMM_SPAWN 3 more. Running the orteds through valgrind, I see a bunch of uninitialized epoch issues.
>>>
>>> Attached at the 2 valgrind outputs.
>>>
>>> Can these be fixed? I don't know if they're actual problems or not, but seeing uninitialized values go by makes me extremely nervous.
>>>
>>> Thanks!
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel