Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Question on suspending/resuming MPI processes with SIGSTOP
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-04-11 15:51:06


I'm afraid our suspend/resume support only allows the signal to be applied to *all* procs, not selectively to some. For that matter, I'm unaware of any MPI-level API for hitting a proc with a signal - so I'm not sure how you would programmatically have rank0 suspend some other ranks.

On Apr 11, 2014, at 12:16 PM, Frank Wein <mcsmurf_at_[hidden]> wrote:

> Hi,
> I've got a question on suspending/resuming an process started with
> "mpirun", I've already found the FAQ entry on this
> http://www.open-mpi.de/faq/?category=running#suspend-resume but I've
> still got a question on this. Basically for now let's assume I'm running
> all MPI processes on one host only with one multi-core CPU (so I could
> directly send SIGSTOP to other processes if I want to). What I wonder
> about is the following: I want to start multiple (let's say four)
> instances of my program with "mpirun -np 4 ./mybinary" and at some point
> during the program execution I want to suspend two of those four
> processes, those two processes are waiting at an MPI_Barrier() at this
> point. The goal of that is to suspend execution so that those processes
> don't use the CPU at all while they are suspended (that's not the case
> with MPI_Barrier as far as I understand this). So now my question
> basically is: Will it work when I send SIGSTOP signal from my MPI rank 0
> process to these two processes while they are waiting at an MPI_Barrier
> and then those two processes won't use the CPU anymore? Later I want to
> resume the processes with SIGCONT when the other two processes also
> arrived at this MPI_Barrier. Performance of the barrier does not matter
> here, what matters for me is that those suspended processes don't cause
> any CPU usage. I never used SIGSTOP signal so far, so I'm not sure if
> this will work. And before I start coding the logic for this into my
> program, I thought I'll ask here first if this will work at all :).
>
> Frank
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users