Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] ompi-checkpoint problem on shared storage
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2011-09-23 15:24:43


It sounds like there is a race happening in the shutdown of the
processes. I wonder if the app is shutting down in a way that mpirun
does not quite like.

I have not tested the C/R functionality in the 1.4 series in a long
time. Can you give it a try with the 1.5 series, and see if there is
any variation? You might also try the trunk, but I have not tested it
recently enough to know if things are still working correctly or not
(have others?).

I'll file a ticket so we do not lose track of the bug. Hopefully we
will get to it soon.
  https://svn.open-mpi.org/trac/ompi/ticket/2872

Thanks,
Josh

On Fri, Sep 23, 2011 at 3:08 PM, Dave Schulz <dschulz_at_[hidden]> wrote:
> Hi Everyone.
>
> I've been trying to figure out an issue with ompi-checkpoint/blcr.  The
> symptoms seem to be related to what filesystem the
> snapc_base_global_snapshot_dir is located on.
>
> I wrote a simple mpi program where rank 0 sends to 1, 1 to 2, etc. then the
> highest sends to 0. then it waits 1 sec and repeats.
>
> I'm using openmpi-1.4.3 and when I run "ompi-checkpoint --term
> <pidofmpirun>" on the shared filesystems, the ompi-checkpoint returns a
> checkpoint reference, the worker processes go away, but the mpirun remains
> but is stuck (It dies right away if I run kill on it -- so it's responding
> to SIGTERM).  If I attach an strace to the mpirun, I get the following from
> strace forever:
>
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
> {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0, events=POLLIN}], 6,
> 1000) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
> {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0, events=POLLIN}], 6,
> 1000) = 0 (Timeout)
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
> {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0, events=POLLIN}], 6,
> 1000) = 0 (Timeout)
>
> I'm running with:
> mpirun -machinefile machines -am ft-enable-cr ./mpiloop
> the "machines" file simply has the local hostname listed a few times.  I've
> tried 2 and 8.  I can try up to 24 as this node is a pretty big one if it's
> deemed useful.  Also, there's 256Gb of RAM.  And it's Opteron 6 core, 4
> socket if that helps.
>
>
> I initially installed this on a test system with only local harddisks and
> standard nfs on Centos 5.6 where everything worked as expected.  When I
> moved over to the production system things started breaking.  The filesystem
> is the major software difference.  The shared filesystems are Ibrix and that
> is where the above symptoms started to appear.
>
> I haven't even moved on to multi-node mpi runs as I can't even get this to
> work for any number of processes on the local machine except if I set the
> checkpoint directory to /tmp which is on a local xfs harddisk.  If I put the
> checkpoints on any shared directory, things fail.
>
> I've tried a number of *_verbose mca parameters and none of them seem to
> issue any messages at the point of checkpoint, only when I give-up and send
> kill `pidof mpirun` are there any further messages.
>
> openmpi is compiled with:
> ./configure --prefix=/global/software/openmpi-blcr
> --with-blcr=/global/software/blcr
> --with-blcr-libdir=/global/software/blcr/lib/ --with-ft=cr
> --enable-ft-thread --enable-mpi-threads --with-openib --with-tm
>
> and blcr only has a prefix to put it in /global/software/blcr otherwise it's
> vanilla.  Both are compiled with the default gcc.
>
> One final note, is that occasionally it does succeed and terminate.  But it
> seems completely random.
>
> What I'm wondering is has anyone else seen symptoms like this -- especially
> where the mpirun doesn't quit after a checkpoint with --term but the worker
> processes do?
>
> Also, is there some sort of somewhat unusual filesystem semantic that our
> shared filesystem may not support that ompi/ompi-checkpoint is needing?
>
> Thanks for any insights you may have.
>
> -Dave
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey