Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Question about restart
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-04-27 13:07:59


Thanks for the patch. I applied a version of it to the Open MPI trunk
(r21079) and started the process of moving it to the v1.3 release
series:
   https://svn.open-mpi.org/trac/ompi/ticket/1898

Thanks,
Josh

On Apr 27, 2009, at 9:53 AM, Josh Hursey wrote:

> Thanks for the bug report.
>
> I am having a difficult time reproducing the error. Are you running
> on a single machine using shared memory or across multiple machine
> using a high speed network?
>
> Based on your bug report, my suspicion is that an event is not being
> properly de-registered from the event engine. Typically this means,
> for C/R, that the finalization routine in one of the BTLs
> (interconnect drivers) is missing something. The patch you propose
> seems fine, but I agree that it may be masking another problem.
>
> I'll keep digging and let you know if I find something. In the mean
> time I will attempt to push in a patch to protect the free() that
> cited.
>
> Cheers,
> Josh
>
> On Apr 22, 2009, at 4:09 PM, Yaakoub El Khamra wrote:
>
>> Incidentally, if I add a check for the value base->sig.sh_old, that
>> it
>> is not NULL, and recompile, everything works fine. I am concerned
>> this
>> is just fixing a symptom rather than the root of the problem.
>>
>> if(base->sig.sh_old != NULL)
>> free(base->sig.sh_old);
>>
>> is what I used.
>>
>> Regards
>> Yaakoub El Khamra
>>
>>
>>
>>
>> On Wed, Apr 22, 2009 at 2:13 PM, Yaakoub El Khamra
>> <yye00_at_[hidden]> wrote:
>>> Greetings
>>> I am trying to get the checkpoint/restart to work on a single
>>> machine
>>> with openmpi 1.3 (also tried an svn check-out) and ran into a few
>>> problems. I am guessing I am doing something wrong, and would
>>> appreciate some help.
>>>
>>> I built openmpi with:
>>> ./configure --prefi=/usr/local/openmpi-1.3/ --enable-picky
>>> --enable-debug --enable-mpi-f77 --enable-mpi-f90 --enable-mpi-
>>> profile
>>> --enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries
>>> --enable-trace --enable-static=yes --enable-debug
>>> --with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr
>>> --enable-ft-thread --with-blcr=/usr/local/blcr/
>>> --with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes
>>>
>>> I am using blcr 0.8.1 configured with:
>>> ./configure --prefix=/usr/local/blcr/ --enable-debug=yes
>>> --enable-libcr-tracing=yes --enable-kernel-tracing=yes
>>> --enable-testsuite=yes --enable-all-static=yes --enable-static=yes
>>>
>>> Checkpoint works fine, without any problems, I run with:
>>> mpirun -np 2 -mca ft_cr_enabled 1 -mca ompi_cr_verbose 1 -am
>>> ft-enable-cr -mca crs_verbose 1 -mca crs_blcr_verbose 1
>>> matmultf.exe
>>>
>>> I am able to checkpoint without any problems using ompi-checkpoint
>>> --status --term <pid>
>>> but when I try to restart, I get the following error:
>>>
>>> [yye00_at_localhost FTOpenMPI]$ ompi-restart -v
>>> ompi_global_snapshot_23858.ckpt
>>> [localhost.localdomain:24394] Checking for the existence of
>>> (/home/yye00/ompi_global_snapshot_23858.ckpt)
>>> [localhost.localdomain:24394] Restarting from file
>>> (ompi_global_snapshot_23858.ckpt)
>>> [localhost.localdomain:24394] Exec in self
>>> malloc debug: Invalid free (signal.c, 304)
>>> malloc debug: Invalid free (signal.c, 304)
>>> [localhost:23860] *** Process received signal ***
>>> [localhost:23860] Signal: Bus error (7)
>>> [localhost:23860] Signal code: (2)
>>> [localhost:23860] Failing at address: 0x7fcbb737ef88
>>> [localhost:23860] [ 0] /lib64/libpthread.so.0 [0x32d720f0f0]
>>> [localhost:23860] [ 1] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
>>> [0x7fcbbd1eccae]
>>> [localhost:23860] [ 2] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
>>> [0x7fcbbd1ed5ba]
>>> [localhost:23860] [ 3] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
>>> [0x7fcbbd1ed745]
>>> [localhost:23860] [ 4]
>>> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_progress+0xbc)
>>> [0x7fcbbcba2aa0]
>>> [localhost:23860] [ 5] /usr/local/openmpi-1.3_svn/lib/libopen-
>>> pal.so.0
>>> [0x7fcbbcbdead1]
>>> [localhost:23860] [ 6] /usr/local/openmpi-1.3_svn/lib/libopen-
>>> pal.so.0
>>> [0x7fcbbcbde8e2]
>>> [localhost:23860] [ 7]
>>> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.
>>> 0(opal_crs_blcr_checkpoint+0x19c)
>>> [0x7fcbbcbde17c]
>>> [localhost:23860] [ 8]
>>> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_inc_core
>>> +0xb2)
>>> [0x7fcbbcba45e9]
>>> [localhost:23860] [ 9] /usr/local/openmpi-1.3_svn/lib/libopen-
>>> rte.so.0
>>> [0x7fcbbced1d9d]
>>> [localhost:23860] [10]
>>> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.
>>> 0(opal_cr_test_if_checkpoint_ready+0x11b)
>>> [0x7fcbbcba4509]
>>> [localhost:23860] [11] /usr/local/openmpi-1.3_svn/lib/libopen-
>>> pal.so.0
>>> [0x7fcbbcba4bc2]
>>> [localhost:23860] [12] /lib64/libpthread.so.0 [0x32d72073da]
>>> [localhost:23860] [13] /lib64/libc.so.6(clone+0x6d) [0x32d66e62bd]
>>> [localhost:23860] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 1 with PID 24396 on node
>>> localhost.localdomain exited on signal 7 (Bus error).
>>> --------------------------------------------------------------------------
>>>
>>> running strace on the ompi-restart did not provide any useful
>>> information. Any suggestions are greatly appreciated. Incidentally,
>>> looking at the signal.c line 304, it is a deallocation subroutine in
>>> opal, it is the evsignal_dealloc subroutine, the actual line is the
>>> "free(base->sig.sh_old);" line . I am about to add debug
>>> statements to
>>> that subroutine and see if I can get further information, but was
>>> hoping the problem is more user-related than code-related.
>>>
>>>
>>> Regards
>>> Yaakoub El Khamra
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users