Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpointing a restarted app fails
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-09-22 12:55:25

I believe this is now fixed in the trunk. I was able to reproduce with
the current trunk and committed a fix a few minutes ago in r19601. So
the fix should be in tonight's tarball (or you can grab it from SVN).
I've made a request to have the patch applied to v1.3, but that may
take a day or so to complete.

Let me know if this fix eliminates your SIGPIPE issues.

Thanks for the bug report :)


On Sep 17, 2008, at 11:55 PM, Matthias Hovestadt wrote:

> Hi Josh!
> First of all, thanks a lot for replying. :-)
>>> When executing this checkpoint command, the running application
>>> directly aborts, even though I did not specify the "--term" option:
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 1 with PID 14050 on node grid-
>>> exited on signal 13 (Broken pipe).
>>> --------------------------------------------------------------------------
>>> ccs_at_grid-demo-1:~$
>> Interesting. This looks like a bug with the restart mechanism in
>> Open MPI. This was working fine, but something must have changed in
>> the trunk to break it.
> Do you perhaps know a SVN revision number of OMPI that
> is known to be working? If this issue is a regression
> failure, I would be glad to use the source from an old
> but working SVN state...
>> A useful piece of debugging information for me would be a stack
>> trace from the failed process. You should be able to get this from
>> a core file it left or If you would set the following MCA variable
>> in $HOME/.openmpi/mca-params.conf:
>> opal_cr_debug_sigpipe=1
>> This will cause the Open MPI app to wait in a sleep loop when it
>> detects a Broken Pipe signal. Then you should be able to attach a
>> debugger and retrieve a stack trace.
> I created this file:
> ccs_at_grid-demo-1:~$ cat .openmpi/mca-params.conf
> opal_cr_debug_sigpipe=1
> ccs_at_grid-demo-1:~$
> Then I restarted the application from a checkpointed state
> and tried to checkpoint this restarted application. Unfortunately
> the restated application still terminates, despite of this para-
> meter. However, the output slightly changed :
> worker fetch area available 1
> [] opal_cr: sigpipe_debug: Debug
> SIGPIPE [13]: PID (26220)
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 26248 on node grid-
> exited on signal 0 (Unknown signal 0).
> --------------------------------------------------------------------------
> 2 total processes killed (some possibly by mpirun during cleanup)
> ccs_at_grid-demo-1:~$
> There is now this additional "opal_cr: sigpipe_debug" line, so
> he apparently evaluates the .openmpi/mca-params.conf
> I also tried to get a corefile by setting "ulimit -c 50000", so
> that ulimit -a gives me the following output:
> ccs_at_grid-demo-1:~$ ulimit -a
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> scheduling priority (-e) 20
> file size (blocks, -f) unlimited
> pending signals (-i) unlimited
> max locked memory (kbytes, -l) unlimited
> max memory size (kbytes, -m) unlimited
> open files (-n) 1024
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) unlimited
> real-time priority (-r) 0
> stack size (kbytes, -s) 8192
> cpu time (seconds, -t) unlimited
> max user processes (-u) unlimited
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
> ccs_at_grid-demo-1:~$
> Unfortunately, no corefile is generated, so that I do not know
> how to give you the requested stacktrace.
> Are there perhaps other debug parameters I could use?
> Best,
> Matthias
> _______________________________________________
> users mailing list
> users_at_[hidden]