Hello,
I'm trying to come up with a fault tolerant OpenMPI setup for research
purposes. I'm doing some tests now, but I'm stuck with a segfault when
I try to restart my test program from a checkpoint.
My test program is the "ring" program, where messages are sent to the
next node in the ring N times. It's pretty simple, I can supply the
source code if needed. I'm running it like this:
# mpirun -np 4 -am ft-enable-cr ring
...
>>> Process 1 sending 703 to 2
>>> Process 3 received 704
>>> Process 3 sending 704 to 0
>>> Process 3 received 703
>>> Process 3 sending 703 to 0
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 18358 on node debian1
exited on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
4 total processes killed (some possibly by mpirun during cleanup)
That's the output when I ompi-checkpoint the mpirun PID from another terminal.
The checkpoint is taken just fine in maybe 1.5 seconds. I can see the
checkpoint directory has been created in $HOME.
This is what I get when I try to run ompi-restart
ps axroot_at_debian1:~# ps ax | grep mpirun
18357 pts/0 R+ 0:01 mpirun -np 4 -am ft-enable-cr ring
18378 pts/5 S+ 0:00 grep mpirun
root_at_debian1:~# ompi-checkpoint 18357
Snapshot Ref.: 0 ompi_global_snapshot_18357.ckpt
root_at_debian1:~# ompi-checkpoint --term 18357
Snapshot Ref.: 1 ompi_global_snapshot_18357.ckpt
root_at_debian1:~# ompi-restart ompi_global_snapshot_18357.ckpt
--------------------------------------------------------------------------
Error: Unable to obtain the proper restart command to restart from the
checkpoint file (opal_snapshot_2.ckpt). Returned -1.
--------------------------------------------------------------------------
[debian1:18384] *** Process received signal ***
[debian1:18384] Signal: Segmentation fault (11)
[debian1:18384] Signal code: Address not mapped (1)
[debian1:18384] Failing at address: 0x725f725f
[debian1:18384] [ 0] [0xb775f40c]
[debian1:18384] [ 1]
/usr/local/lib/libopen-pal.so.0(opal_argv_free+0x33) [0xb771ea63]
[debian1:18384] [ 2]
/usr/local/lib/libopen-pal.so.0(opal_event_fini+0x30) [0xb77150a0]
[debian1:18384] [ 3]
/usr/local/lib/libopen-pal.so.0(opal_finalize+0x35) [0xb7708fa5]
[debian1:18384] [ 4] opal-restart [0x804908e]
[debian1:18384] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5)
[0xb7568b55]
[debian1:18384] [ 6] opal-restart [0x8048fc1]
[debian1:18384] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 18384 on node debian1
exited on signal 11 (Segmentat
--------------------------------------------------------------------------
I used a clean install of Debian Squeeze (testing) to make sure my
environment was ok. Those are the steps I took:
- Installed Debian Squeeze, only base packages
- Installed build-essential, libcr0, libcr-dev, blcr-dkms (build
tools, BLCR dev and run-time environment)
- Compiled openmpi-1.4.1
Note that I did compile openmpi-1.4.1 because the Debian package
(openmpi-checkpoint) doesn't seem to be usable at the moment. There
are no leftovers from any previous install of Debian packages
supplying OpenMPI because this is a fresh install, no openmpi package
had been installed before.
I used the following configure options:
# ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
I also tried to add the option --with-memory-manager=none because I
saw an e-mail on the mailing list that described this as a possible
solution to an (apparently) not related problem, but the problem
remains the same.
I don't have config.log (I rm'ed the build dir), but if you think it's
necessary I can recompile OpenMPI and provide it.
Some information about the system (VirtualBox virtual machine, single
processor, btw):
Kernel version 2.6.32-trunk-686
root_at_debian1:~# lsmod | grep blcr
blcr 79084 0
blcr_imports 2077 1 blcr
libcr (BLCR) is version 0.8.2-9.
gcc is version 4.4.3.
Please let me know of any other information you might need.
Thanks in advance,
|