Subject: [OMPI users] OMPI error terminate w/o reasons
From: Jack Bryan (dtustudy68_at_[hidden])
Date: 2011-03-26 01:49:23

Hi , All:
I running a Open MPI (1.3.4) program by 200 parallel processes.
But, the program is terminated with
--------------------------------------------------------------------------mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).--------------------------------------------------------------------------
After searching, the signal 9 means:
process is currently in an unworkable state and should be terminated with
extreme prejudice
 If a process does not respond to any other
termination signals, sending it a SIGKILL signal will almost always cause it to
go away.
 The system will generate SIGKILL for a process itself under
some unusual conditions where the program cannot possibly continue to run (even
to run a signal handler).
But, the error message does not indicate any possible reasons for the termination.
There is a FOR loop in the main() program, if the loop number is small (< 200), the program works well, but if it becomes lager and larger, the program will got SIGKILL.
The cluster where I am running the MPI program does not allow running debug tools.
If I run it on a workstation, it will take a very very long time (for > 200 loops) in order to get the error occur again.
What can I do to find the possible bugs ?
Any help is really appreciated.