Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Proper use of sigaction in Open MPI?
From: Ralph H Castain (rhc_at_[hidden])
Date: 2008-04-24 11:56:56


I have never tested this before, so I could be wrong. However, my best guess
is that the following is happening:

1. you trap the signal and do your cleanup. However, when your proc now
exits, it does not exit with a status of "terminated-by-signal". Instead, it
exits normally.

2. the local daemon sees the proc exit, but since it exit'd normally, it
takes no action to abort the job. Hence, mpirun has no idea that anything
"wrong" has happened, nor that it should do anything about it.

3. if you re-raise the signal, the proc now exits with
"terminated-by-signal", so the abort procedure works as intended.

Since you call mpi_finalize before leaving, even the upcoming 1.3 release
would be "fooled" by this behavior. It will again think that the proc exit'd
normally, and happily wait for all the procs to "complete".

Now, if -all- of your procs receive this signal and terminate, then the
system should shutdown. But I gather from your note that this isn't the case
- that only a subset, perhaps only one, of the procs is taking this action?

If all of the procs are exiting, then it is possible that there is a bug in
the 1.2 release that is getting confused by the signals. Mpirun does trap
SIGTERM to order a clean abort of all procs, so it is possible that a race
condition is getting activated and causing mpirun to hang. Unfortunately,
that can happen in the 1.2 series. The 1.3 release should be more robust in
that regard.

I don't think what you are doing will cause any horrid problems. Like I
said, I have never tried something like this, so I might be surprised.

But if you job cleans up the way you want, I certainly wouldn't worry about
it. At the worst, there might be some dangling tmp files from Open MPI.

Ralph

On 4/24/08 8:51 AM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:

> Thoughts?
>
> Is this a "fixed in 1.3" issue?
>
> -jms
> Sent from my PDA. No type good.
>
> -----Original Message-----
> From: Keller, Jesse [mailto:jesse.keller_at_[hidden]]
> Sent: Thursday, April 24, 2008 09:35 AM Eastern Standard Time
> To: users_at_[hidden]
> Subject: [OMPI users] Proper use of sigaction in Open MPI?
>
> Hello, all -
>
>
>
> I have an OpenMPI application that generates a file while it runs. No big
> deal. However, I¹d like to delete the partial file if the job is aborted via
> a user signal. In a non-MPI application, I¹d use sigaction to intercept the
> SIGTERM and delete the open files there. I¹d then call the ³old² signal
> handler. When I tried this with my OpenMPI program, the signal was caught,
> the files deleted, the processes exited, but the MPI exec command as a whole
> did not exit. This is the technique, by the way, that was described in this
> IBM MPI document:
>
>
>
> http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ib
> m.cluster.pe.doc/pe_linux42/am106l0037.html
>
>
>
> My question is, what is the ³right² way to do this under OpenMPI? The only
> way I got the thing to work was by resetting the sigaction to the old handler
> and re-raising the signal. It seems to work, but I want to know if I am going
> to get ³bit² by this. Specifically, am I ³closing² MPI correctly by doing
> this?
>
>
>
> I am running OpenMPI 1.2.5 under Fedora 8 on Linux in a x86_64 environment.
> My compiler is gcc 4.1.2. This behavior happens when all processes are
> running on the same node using shared memory and between nodes when using TCP
> transport. I don¹t have access to any other transport.
>
>
>
> Thanks for your help.
>
>
>
> Jesse Keller
>
> 454 Life Sciences
>
>
>
> Here¹s a code snippet to demonstrate what I¹m talking about.
>
>
>
> ------------------------------------------------------------------------------
> ----------------------
>
>
>
> struct sigaction sa_old_term; /* Global. */
>
>
>
> void
>
> SIGTERM_handler(int signal , siginfo_t * siginfo , void * a)
>
> {
>
> UnlinkOpenedFiles(); /* Global function to delete partial files. */
>
> /* The commented code doesn¹t work. */
>
> //if (sa_old_term.sa_sigaction)
>
> //{
>
> // sa_old_term.sa_flags =SA_SIGINFO;
>
> // (*sa_old_term.sa_sigaction)(signal,siginfo,a);
>
> //}
>
> sigaction(SIGTERM, &sa_old_term,NULL);
>
> raise(signal);
>
> }
>
>
>
> int main( int argc, char * argv)
>
> {
>
> MPI::Init(argc, argv);
>
>
>
> struct sigaction sa_term;
>
> sigemptyset(&sa_term.sa_mask);
>
> sa_term.sa_flags = SA_SIGINFO;
>
> sa_term.sa_sigaction = SIGTERM_handler;
>
> sigaction(SIGTERM, &sa_term, &sa_old_term);
>
>
>
> doSomeMPIComputation();
>
> MPI::Finalize();
>
> return 0;
>
> }
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users