Hi , All: 

I running a Open MPI (1.3.4) program by 200 parallel processes. 

But, the program is terminated with 

--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).
--------------------------------------------------------------------------

After searching, the signal 9 means: 

the process is currently in an unworkable state and should be terminated with extreme prejudice


 If a process does not respond to any other termination signals, sending it a SIGKILL signal will almost always cause it to go away.


 The system will generate SIGKILL for a process itself under some unusual conditions where the program cannot possibly continue to run (even to run a signal handler).

 

But, the error message does not indicate any possible reasons for the termination. 


There is a FOR loop in the main() program, if the loop number is small (< 200), the program works well, 

but if it becomes lager and larger, the program will got SIGKILL. 


The cluster where I am running the MPI program does not allow running debug tools. 


If I run it on a workstation, it will take a very very long time (for > 200 loops) in order to 

get the error occur again. 


What can I do to find the possible bugs ? 


Any help is really appreciated. 


thanks


Jack