Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Checkpointing a restarted app fails
From: Matthias Hovestadt (maho_at_[hidden])
Date: 2008-09-17 23:55:44


Hi Josh!

First of all, thanks a lot for replying. :-)

>> When executing this checkpoint command, the running application
>> directly aborts, even though I did not specify the "--term" option:
>>
>> --------------------------------------------------------------------------
>>
>> mpirun noticed that process rank 1 with PID 14050 on node
>> grid-demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe).
>> --------------------------------------------------------------------------
>>
>> ccs_at_grid-demo-1:~$
>
> Interesting. This looks like a bug with the restart mechanism in Open
> MPI. This was working fine, but something must have changed in the trunk
> to break it.

Do you perhaps know a SVN revision number of OMPI that
is known to be working? If this issue is a regression
failure, I would be glad to use the source from an old
but working SVN state...

> A useful piece of debugging information for me would be a stack trace
> from the failed process. You should be able to get this from a core file
> it left or If you would set the following MCA variable in
> $HOME/.openmpi/mca-params.conf:
> opal_cr_debug_sigpipe=1
> This will cause the Open MPI app to wait in a sleep loop when it detects
> a Broken Pipe signal. Then you should be able to attach a debugger and
> retrieve a stack trace.

I created this file:

ccs_at_grid-demo-1:~$ cat .openmpi/mca-params.conf
opal_cr_debug_sigpipe=1
ccs_at_grid-demo-1:~$

Then I restarted the application from a checkpointed state
and tried to checkpoint this restarted application. Unfortunately
the restated application still terminates, despite of this para-
meter. However, the output slightly changed :

worker fetch area available 1
[grid-demo-1.cit.tu-berlin.de:26220] opal_cr: sigpipe_debug: Debug
SIGPIPE [13]: PID (26220)
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 26248 on node
grid-demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
2 total processes killed (some possibly by mpirun during cleanup)
ccs_at_grid-demo-1:~$

There is now this additional "opal_cr: sigpipe_debug" line, so
he apparently evaluates the .openmpi/mca-params.conf

I also tried to get a corefile by setting "ulimit -c 50000", so
that ulimit -a gives me the following output:

ccs_at_grid-demo-1:~$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 20
file size (blocks, -f) unlimited
pending signals (-i) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) unlimited
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
ccs_at_grid-demo-1:~$

Unfortunately, no corefile is generated, so that I do not know
how to give you the requested stacktrace.

Are there perhaps other debug parameters I could use?

Best,
Matthias