Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Segfault in ompi-restart (ft-enable-cr)
From: Joshua Hursey (jjhursey_at_[hidden])
Date: 2010-03-03 15:31:49


On Mar 2, 2010, at 9:17 AM, Fernando Lemos wrote:

> On Sun, Feb 28, 2010 at 11:11 PM, Fernando Lemos <fernandotcl_at_[hidden]> wrote:
>> Hello,
>>
>>
>> I'm trying to come up with a fault tolerant OpenMPI setup for research
>> purposes. I'm doing some tests now, but I'm stuck with a segfault when
>> I try to restart my test program from a checkpoint.
>>
>> My test program is the "ring" program, where messages are sent to the
>> next node in the ring N times. It's pretty simple, I can supply the
>> source code if needed. I'm running it like this:
>>
>> # mpirun -np 4 -am ft-enable-cr ring
>> ...
>>>>> Process 1 sending 703 to 2
>>>>> Process 3 received 704
>>>>> Process 3 sending 704 to 0
>>>>> Process 3 received 703
>>>>> Process 3 sending 703 to 0
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 18358 on node debian1
>> exited on signal 0 (Unknown signal 0).
>> --------------------------------------------------------------------------
>> 4 total processes killed (some possibly by mpirun during cleanup)
>>
>> That's the output when I ompi-checkpoint the mpirun PID from another terminal.
>>
>> The checkpoint is taken just fine in maybe 1.5 seconds. I can see the
>> checkpoint directory has been created in $HOME.
>>
>> This is what I get when I try to run ompi-restart
>>
>> ps axroot_at_debian1:~# ps ax | grep mpirun
>> 18357 pts/0 R+ 0:01 mpirun -np 4 -am ft-enable-cr ring
>> 18378 pts/5 S+ 0:00 grep mpirun
>> root_at_debian1:~# ompi-checkpoint 18357
>> Snapshot Ref.: 0 ompi_global_snapshot_18357.ckpt
>> root_at_debian1:~# ompi-checkpoint --term 18357
>> Snapshot Ref.: 1 ompi_global_snapshot_18357.ckpt
>> root_at_debian1:~# ompi-restart ompi_global_snapshot_18357.ckpt
>> --------------------------------------------------------------------------
>> Error: Unable to obtain the proper restart command to restart from the
>> checkpoint file (opal_snapshot_2.ckpt). Returned -1.
>>
>> --------------------------------------------------------------------------
>> [debian1:18384] *** Process received signal ***
>> [debian1:18384] Signal: Segmentation fault (11)
>> [debian1:18384] Signal code: Address not mapped (1)
>> [debian1:18384] Failing at address: 0x725f725f
>> [debian1:18384] [ 0] [0xb775f40c]
>> [debian1:18384] [ 1]
>> /usr/local/lib/libopen-pal.so.0(opal_argv_free+0x33) [0xb771ea63]
>> [debian1:18384] [ 2]
>> /usr/local/lib/libopen-pal.so.0(opal_event_fini+0x30) [0xb77150a0]
>> [debian1:18384] [ 3]
>> /usr/local/lib/libopen-pal.so.0(opal_finalize+0x35) [0xb7708fa5]
>> [debian1:18384] [ 4] opal-restart [0x804908e]
>> [debian1:18384] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5)
>> [0xb7568b55]
>> [debian1:18384] [ 6] opal-restart [0x8048fc1]
>> [debian1:18384] *** End of error message ***
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 2 with PID 18384 on node debian1
>> exited on signal 11 (Segmentat
>> --------------------------------------------------------------------------
>>
>> I used a clean install of Debian Squeeze (testing) to make sure my
>> environment was ok. Those are the steps I took:
>>
>> - Installed Debian Squeeze, only base packages
>> - Installed build-essential, libcr0, libcr-dev, blcr-dkms (build
>> tools, BLCR dev and run-time environment)
>> - Compiled openmpi-1.4.1
>>
>> Note that I did compile openmpi-1.4.1 because the Debian package
>> (openmpi-checkpoint) doesn't seem to be usable at the moment. There
>> are no leftovers from any previous install of Debian packages
>> supplying OpenMPI because this is a fresh install, no openmpi package
>> had been installed before.
>>
>> I used the following configure options:
>>
>> # ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
>>
>> I also tried to add the option --with-memory-manager=none because I
>> saw an e-mail on the mailing list that described this as a possible
>> solution to an (apparently) not related problem, but the problem
>> remains the same.
>>
>> I don't have config.log (I rm'ed the build dir), but if you think it's
>> necessary I can recompile OpenMPI and provide it.
>>
>> Some information about the system (VirtualBox virtual machine, single
>> processor, btw):
>>
>> Kernel version 2.6.32-trunk-686
>>
>> root_at_debian1:~# lsmod | grep blcr
>> blcr 79084 0
>> blcr_imports 2077 1 blcr
>>
>> libcr (BLCR) is version 0.8.2-9.
>>
>> gcc is version 4.4.3.
>>
>>
>> Please let me know of any other information you might need.
>>
>>
>> Thanks in advance,
>>
>
> Hello,
>
> I figured it out. The problem is that the Debian package brcl-utils,
> which contains the BLCR binaries (cr_restart, cr_checkpoint, etc.)
> wasn't installed. I believe OpenMPI could perhaps show a more
> descriptive message instead of segfaulting, though? Also, you might
> want to add that information to the FAQ.
>
> Anyways, I'm filing another Debian bug report.
>
> For the sake of completeness, here's, some more information:
>
> - I forgot to mention that since I've installed OpenMPI to /usr/local.
> So I'm setting LD_LIBRARY_PATH to /usr/lib:/usr/local/lib in .bashrc,
> and thus I can run any OpenMPI command without problems.
>
> - I tested BLCR with cr_checkpoint and cr_restart with a simple app,
> and it worked great too.
>
> - I've purged /usr/local and rebuilt OpenMPI with the mentioned flags
> to obtain the attached config.log (gzipped).
>
> - With brcl-utils installed, I can ompi-restart just fine. Without it
> installed, I get the segfault mentioned in my previous message.

Yes, ompi-restart should be printing a helpful message and exiting normally. Thanks for the bug report. I believe that I have seen and fixed this on a development branch making its way to the trunk. I'll make sure to move the fix to the 1.4 series once it has been applied to the trunk.

I filed a ticket on this if you wanted to track the issue.
  https://svn.open-mpi.org/trac/ompi/ticket/2329

Thanks again,
Josh

>
>
>
> Best regards,
> <config.log.gz>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users