Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] signal 15 (terminated)
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-02-04 08:53:56


On Feb 3, 2009, at 10:15 PM, Hana Milani wrote:

> sorry if I didn't answer:
>
> Have you checked to ensure that the job manager is not killing your
> job?
>
> I am not quite sure what you mean by job manager, but, this is the
> personal computer of mine. Much to my surprise, I have also open
> suse on my laptop, took the similar procedure then the same message
> appeared !!!!

Ok.

> Is there a local system administrator that you can talk to about this?
>
> Not a very good one, but I asked someone who had seen this message
> on his own works and this was his answer:
>
> It means that the program corresponding to the process identifier
> 2407 (the PID you can find on the second column from the "ps aux"
> command) running on one of you cluster's node (named linux-4pel) has
> stopped because it has received the signal SIGTERM (termination
> signal 15). Sorry if this is a long explanation of things you
> already know :-). Let's say thay you have a program running on your
> system ; you can figure out its process ID number nnnnn by doing a
> "ps aux". Now if you want to stop it - f.e. because it is out of
> control - a convenient way is to send a termination request to the
> process by issuing the "kill -s SIGTERM nnnnn". Here, openmpi
> notified to you that one of the spawned processes has been
> terminated because it has received the SIGTERM signal and, as a
> consequence, has stopped all the other distributed processes running
> on the nodes - as PID 2407 process has acknowledged SIGTERM, openmpi
> has sent SIGTERM to all the processes associated to your parallel run.

This is exactly correct.

> Now ... how to avoid this? I am afraid there is no easy answer. The
> 2407 process has probably received a SIGTERM from another
> application - I mean it has not died by accident (a hanging or
> faulting process exits without invoking the MPI_FINALYZE and
> produces a different error message). The difficulty is that you have
> to investigate what application has issued the SIGTERM - what
> application has told your 2407 process to terminate.

Also exactly correct.

> If you are working on a cluster managing the MPI distributed
> processes to the nodes with a resource manager (like SLURM, PBS or
> Torque), I would check if the manager is not limiting the memory
> size footprint or the CPU time of the jobs accepted by the
> linux-4pel computer.

This is what I was asking you; you're telling me that you have no
resource manager, and therefore this probably isn't the cause. But
*something* is killing your app with a SIGTERM.

> It is tricky for me to figure out what could have asked your program
> to stop ... does it stops immediately or during a long run (CPU
> time?), with small jobs or large ones (memory?) ; is MPI running on
> a personal computer or a huge cluster (resource manager?), do you
> have sufficient privileges to have a look on /var/log/messages on
> linux-4pel?
>
> 1. The code stops running immediately. 2. The computers are my
> personal ones and no administrator has limited the 7.9 GiB memory I
> have. 3. Sequentially the run takes 500-700MiB memory.

Is this a Fortran program, perchance?

Do you have access to the source code? I wonder if the program is
internally raising an error and effectively aborting itself. Do you
know that the application runs correctly? Do you have any test data
sets that you can try that give known outputs?

-- 
Jeff Squyres
Cisco Systems