Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Checkpointing a restarted app fails
From: Josh Hursey (jjhursey_at_[hidden])
Date: 2008-09-22 12:55:25

I believe this is now fixed in the trunk. I was able to reproduce with
the current trunk and committed a fix a few minutes ago in r19601. So
the fix should be in tonight's tarball (or you can grab it from SVN).
I've made a request to have the patch applied to v1.3, but that may
take a day or so to complete.

Let me know if this fix eliminates your SIGPIPE issues.

Thanks for the bug report :)


On Sep 17, 2008, at 11:55 PM, Matthias Hovestadt wrote:

> Hi Josh!
> First of all, thanks a lot for replying. :-)
>>> When executing this checkpoint command, the running application
>>> directly aborts, even though I did not specify the "--term" option:
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 1 with PID 14050 on node grid-
>>> exited on signal 13 (Broken pipe).
>>> --------------------------------------------------------------------------
>>> ccs_at_grid-demo-1:~$
>> Interesting. This looks like a bug with the restart mechanism in
>> Open MPI. This was working fine, but something must have changed in
>> the trunk to break it.
> Do you perhaps know a SVN revision number of OMPI that
> is known to be working? If this issue is a regression
> failure, I would be glad to use the source from an old
> but working SVN state...
>> A useful piece of debugging information for me would be a stack
>> trace from the failed process. You should be able to get this from
>> a core file it left or If you would set the following MCA variable
>> in $HOME/.openmpi/mca-params.conf:
>> opal_cr_debug_sigpipe=1
>> This will cause the Open MPI app to wait in a sleep loop when it
>> detects a Broken Pipe signal. Then you should be able to attach a
>> debugger and retrieve a stack trace.
> I created this file:
> ccs_at_grid-demo-1:~$ cat .openmpi/mca-params.conf
> opal_cr_debug_sigpipe=1
> ccs_at_grid-demo-1:~$
> Then I restarted the application from a checkpointed state
> and tried to checkpoint this restarted application. Unfortunately
> the restated application still terminates, despite of this para-
> meter. However, the output slightly changed :
> worker fetch area available 1
> [] opal_cr: sigpipe_debug: Debug
> SIGPIPE [13]: PID (26220)
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 26248 on node grid-
> exited on signal 0 (Unknown signal 0).
> --------------------------------------------------------------------------
> 2 total processes killed (some possibly by mpirun during cleanup)
> ccs_at_grid-demo-1:~$
> There is now this additional "opal_cr: sigpipe_debug" line, so
> he apparently evaluates the .openmpi/mca-params.conf
> I also tried to get a corefile by setting "ulimit -c 50000", so
> that ulimit -a gives me the following output:
> ccs_at_grid-demo-1:~$ ulimit -a
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> scheduling priority (-e) 20
> file size (blocks, -f) unlimited
> pending signals (-i) unlimited
> max locked memory (kbytes, -l) unlimited
> max memory size (kbytes, -m) unlimited
> open files (-n) 1024
> pipe size (512 bytes, -p) 8
> POSIX message queues (bytes, -q) unlimited
> real-time priority (-r) 0
> stack size (kbytes, -s) 8192
> cpu time (seconds, -t) unlimited
> max user processes (-u) unlimited
> virtual memory (kbytes, -v) unlimited
> file locks (-x) unlimited
> ccs_at_grid-demo-1:~$
> Unfortunately, no corefile is generated, so that I do not know
> how to give you the requested stacktrace.
> Are there perhaps other debug parameters I could use?
> Best,
> Matthias
> _______________________________________________
> users mailing list
> users_at_[hidden]