Try adding some print statements so you can see where the error occurs.

On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:

Hi , All: 

I running a Open MPI (1.3.4) program by 200 parallel processes. 

But, the program is terminated with 

--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).
--------------------------------------------------------------------------

After searching, the signal 9 means: 

the process is currently in an unworkable state and should be terminated with extreme prejudice

 If a process does not respond to any other termination signals, sending it a SIGKILL signal will almost always cause it to go away.

 The system will generate SIGKILL for a process itself under some unusual conditions where the program cannot possibly continue to run (even to run a signal handler).
 
But, the error message does not indicate any possible reasons for the termination. 

There is a FOR loop in the main() program, if the loop number is small (< 200), the program works well, 
but if it becomes lager and larger, the program will got SIGKILL. 

The cluster where I am running the MPI program does not allow running debug tools. 

If I run it on a workstation, it will take a very very long time (for > 200 loops) in order to 
get the error occur again. 

What can I do to find the possible bugs ? 

Any help is really appreciated. 

thanks

Jack





_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users