Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] ompi-checkpoint problem on shared storage
From: Dave Schulz (dschulz_at_[hidden])
Date: 2011-09-23 15:08:57


Hi Everyone.

I've been trying to figure out an issue with ompi-checkpoint/blcr. The
symptoms seem to be related to what filesystem the
snapc_base_global_snapshot_dir is located on.

I wrote a simple mpi program where rank 0 sends to 1, 1 to 2, etc. then
the highest sends to 0. then it waits 1 sec and repeats.

I'm using openmpi-1.4.3 and when I run "ompi-checkpoint --term
<pidofmpirun>" on the shared filesystems, the ompi-checkpoint returns a
checkpoint reference, the worker processes go away, but the mpirun
remains but is stuck (It dies right away if I run kill on it -- so it's
responding to SIGTERM). If I attach an strace to the mpirun, I get the
following from strace forever:

poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0,
events=POLLIN}], 6, 1000) = 0 (Timeout)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0,
events=POLLIN}], 6, 1000) = 0 (Timeout)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0,
events=POLLIN}], 6, 1000) = 0 (Timeout)

I'm running with:
mpirun -machinefile machines -am ft-enable-cr ./mpiloop
the "machines" file simply has the local hostname listed a few times.
I've tried 2 and 8. I can try up to 24 as this node is a pretty big one
if it's deemed useful. Also, there's 256Gb of RAM. And it's Opteron 6
core, 4 socket if that helps.

I initially installed this on a test system with only local harddisks
and standard nfs on Centos 5.6 where everything worked as expected.
When I moved over to the production system things started breaking. The
filesystem is the major software difference. The shared filesystems are
Ibrix and that is where the above symptoms started to appear.

I haven't even moved on to multi-node mpi runs as I can't even get this
to work for any number of processes on the local machine except if I set
the checkpoint directory to /tmp which is on a local xfs harddisk. If I
put the checkpoints on any shared directory, things fail.

I've tried a number of *_verbose mca parameters and none of them seem to
issue any messages at the point of checkpoint, only when I give-up and
send kill `pidof mpirun` are there any further messages.

openmpi is compiled with:
./configure --prefix=/global/software/openmpi-blcr
--with-blcr=/global/software/blcr
--with-blcr-libdir=/global/software/blcr/lib/ --with-ft=cr
--enable-ft-thread --enable-mpi-threads --with-openib --with-tm

and blcr only has a prefix to put it in /global/software/blcr otherwise
it's vanilla. Both are compiled with the default gcc.

One final note, is that occasionally it does succeed and terminate. But
it seems completely random.

What I'm wondering is has anyone else seen symptoms like this --
especially where the mpirun doesn't quit after a checkpoint with --term
but the worker processes do?

Also, is there some sort of somewhat unusual filesystem semantic that
our shared filesystem may not support that ompi/ompi-checkpoint is needing?

Thanks for any insights you may have.

-Dave