Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Question about restart
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2009-04-27 09:53:22


Thanks for the bug report.

I am having a difficult time reproducing the error. Are you running on
a single machine using shared memory or across multiple machine using
a high speed network?

Based on your bug report, my suspicion is that an event is not being
properly de-registered from the event engine. Typically this means,
for C/R, that the finalization routine in one of the BTLs
(interconnect drivers) is missing something. The patch you propose
seems fine, but I agree that it may be masking another problem.

I'll keep digging and let you know if I find something. In the mean
time I will attempt to push in a patch to protect the free() that cited.

Cheers,
Josh

On Apr 22, 2009, at 4:09 PM, Yaakoub El Khamra wrote:

> Incidentally, if I add a check for the value base->sig.sh_old, that it
> is not NULL, and recompile, everything works fine. I am concerned this
> is just fixing a symptom rather than the root of the problem.
>
> if(base->sig.sh_old != NULL)
> free(base->sig.sh_old);
>
> is what I used.
>
> Regards
> Yaakoub El Khamra
>
>
>
>
> On Wed, Apr 22, 2009 at 2:13 PM, Yaakoub El Khamra
> <yye00_at_[hidden]> wrote:
>> Greetings
>> I am trying to get the checkpoint/restart to work on a single machine
>> with openmpi 1.3 (also tried an svn check-out) and ran into a few
>> problems. I am guessing I am doing something wrong, and would
>> appreciate some help.
>>
>> I built openmpi with:
>> ./configure --prefi=/usr/local/openmpi-1.3/ --enable-picky
>> --enable-debug --enable-mpi-f77 --enable-mpi-f90 --enable-mpi-profile
>> --enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries
>> --enable-trace --enable-static=yes --enable-debug
>> --with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr
>> --enable-ft-thread --with-blcr=/usr/local/blcr/
>> --with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes
>>
>> I am using blcr 0.8.1 configured with:
>> ./configure --prefix=/usr/local/blcr/ --enable-debug=yes
>> --enable-libcr-tracing=yes --enable-kernel-tracing=yes
>> --enable-testsuite=yes --enable-all-static=yes --enable-static=yes
>>
>> Checkpoint works fine, without any problems, I run with:
>> mpirun -np 2 -mca ft_cr_enabled 1 -mca ompi_cr_verbose 1 -am
>> ft-enable-cr -mca crs_verbose 1 -mca crs_blcr_verbose 1 matmultf.exe
>>
>> I am able to checkpoint without any problems using ompi-checkpoint
>> --status --term <pid>
>> but when I try to restart, I get the following error:
>>
>> [yye00_at_localhost FTOpenMPI]$ ompi-restart -v
>> ompi_global_snapshot_23858.ckpt
>> [localhost.localdomain:24394] Checking for the existence of
>> (/home/yye00/ompi_global_snapshot_23858.ckpt)
>> [localhost.localdomain:24394] Restarting from file
>> (ompi_global_snapshot_23858.ckpt)
>> [localhost.localdomain:24394] Exec in self
>> malloc debug: Invalid free (signal.c, 304)
>> malloc debug: Invalid free (signal.c, 304)
>> [localhost:23860] *** Process received signal ***
>> [localhost:23860] Signal: Bus error (7)
>> [localhost:23860] Signal code: (2)
>> [localhost:23860] Failing at address: 0x7fcbb737ef88
>> [localhost:23860] [ 0] /lib64/libpthread.so.0 [0x32d720f0f0]
>> [localhost:23860] [ 1] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
>> [0x7fcbbd1eccae]
>> [localhost:23860] [ 2] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
>> [0x7fcbbd1ed5ba]
>> [localhost:23860] [ 3] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
>> [0x7fcbbd1ed745]
>> [localhost:23860] [ 4]
>> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_progress+0xbc)
>> [0x7fcbbcba2aa0]
>> [localhost:23860] [ 5] /usr/local/openmpi-1.3_svn/lib/libopen-
>> pal.so.0
>> [0x7fcbbcbdead1]
>> [localhost:23860] [ 6] /usr/local/openmpi-1.3_svn/lib/libopen-
>> pal.so.0
>> [0x7fcbbcbde8e2]
>> [localhost:23860] [ 7]
>> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.
>> 0(opal_crs_blcr_checkpoint+0x19c)
>> [0x7fcbbcbde17c]
>> [localhost:23860] [ 8]
>> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_inc_core
>> +0xb2)
>> [0x7fcbbcba45e9]
>> [localhost:23860] [ 9] /usr/local/openmpi-1.3_svn/lib/libopen-
>> rte.so.0
>> [0x7fcbbced1d9d]
>> [localhost:23860] [10]
>> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.
>> 0(opal_cr_test_if_checkpoint_ready+0x11b)
>> [0x7fcbbcba4509]
>> [localhost:23860] [11] /usr/local/openmpi-1.3_svn/lib/libopen-
>> pal.so.0
>> [0x7fcbbcba4bc2]
>> [localhost:23860] [12] /lib64/libpthread.so.0 [0x32d72073da]
>> [localhost:23860] [13] /lib64/libc.so.6(clone+0x6d) [0x32d66e62bd]
>> [localhost:23860] *** End of error message ***
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 24396 on node
>> localhost.localdomain exited on signal 7 (Bus error).
>> --------------------------------------------------------------------------
>>
>> running strace on the ompi-restart did not provide any useful
>> information. Any suggestions are greatly appreciated. Incidentally,
>> looking at the signal.c line 304, it is a deallocation subroutine in
>> opal, it is the evsignal_dealloc subroutine, the actual line is the
>> "free(base->sig.sh_old);" line . I am about to add debug statements
>> to
>> that subroutine and see if I can get further information, but was
>> hoping the problem is more user-related than code-related.
>>
>>
>> Regards
>> Yaakoub El Khamra
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users