Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Error after ompi-restart
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-11-10 10:20:25


Thanks for the patch. I tested and applied it in r19961 of the Open
MPI Trunk. Sorry it took me so long to do so.

Thanks again,
Josh

On Nov 4, 2008, at 11:10 AM, Leonardo Fialho wrote:

> Josh,
>
> It works fine for me. I think that it is the error.
>
> Leonardo
>
> Josh Hursey escribió:
>> Leonardo,
>>
>> Sorry I have been really slow in replying, I have been pretty
>> swamped lately.
>>
>> What version of the trunk are you using? I've been seeing C/R
>> failures starting around r19872, but I haven't had time to focus on
>> trying to find out what is going wrong.
>>
>> You may be right in your assessment below, I'll try to look into it
>> this week. If you find that making this changes fixes your problem,
>> let me know and I'll apply the patch.
>>
>> Thanks,
>> Josh
>>
>> On Nov 4, 2008, at 10:16 AM, Leonardo Fialho wrote:
>>
>>> I´m not sure, but I think that line 659 on file orte/mca/ess/env/
>>> ess_env_module.c should contain
>>>
>>> if (ORTE_SUCCESS != (ret =
>>> orte_ess_base_build_nidmap(orte_process_info.sync_buf, &nidmap,
>>> *jmap*))) {
>>>
>>> But actually it contains
>>>
>>> if (ORTE_SUCCESS != (ret =
>>> orte_ess_base_build_nidmap(orte_process_info.sync_buf, &nidmap,
>>> *&jmap->pmap*))) {
>>>
>>> No?
>>>
>>> Leonardo
>>>
>>>
>>> Leonardo Fialho escribió:
>>>> Hi All,
>>>>
>>>> I think that exists an error in the trunk version while trying to
>>>> restore a checkpoint.
>>>>
>>>> The function orte_util_decode_pidmap while attempts to execute
>>>> the following code
>>>>
>>>> /* store the data */
>>>> for (i=0; i < num_procs; i++) {
>>>> pmap.node = nodes[i];
>>>> pmap.local_rank = local_rank[i];
>>>> pmap.node_rank = node_rank[i];
>>>> opal_value_array_set_item(procs, i, &pmap);
>>>> }
>>>>
>>>> produces a segmentation fault
>>>>
>>>> [nodo2:18027] *** Process received signal ***
>>>> [nodo2:18027] Signal: Segmentation fault (11)
>>>> [nodo2:18027] Signal code: Address not mapped (1)
>>>> [nodo2:18027] Failing at address: (nil)
>>>>
>>>> I was trying to trace the problem and I think that it occurs in
>>>> the line opal_value_array_set_item(procs, i, &pmap);
>>>>
>>>> Thanks,
>>>>
>>>
>>>
>>> --
>>> Leonardo Fialho
>>> Computer Architecture and Operating Systems Department - CAOS
>>> Universidad Autonoma de Barcelona - UAB
>>> ETSE, Edifcio Q, QC/3088
>>> http://www.caos.uab.es
>>> Phone: +34-93-581-2888
>>> Fax: +34-93-581-2478
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Leonardo Fialho
> Computer Architecture and Operating Systems Department - CAOS
> Universidad Autonoma de Barcelona - UAB
> ETSE, Edifcio Q, QC/3088
> http://www.caos.uab.es
> Phone: +34-93-581-2888
> Fax: +34-93-581-2478
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel