Signal 9 more than likely means that some external entity killed your MPI job (e.g., a resource manager determined that your process took too much time / CPU / whatever and killed it). That also makes sense since you say that short jobs complete with no problem, but (assumedly) longer jobs get killed like you described below -- with signal 9.
You might want to check with your system administrator and see if there are any resource limits on user-run applications.
On Jul 22, 2010, at 8:18 PM, Jack Bryan wrote:
> Dear All:
> I run a parallel job on 6 nodes of an OpenMPI cluster.
> But I got error:
> rank 0 in job 82 system.cluster_37948 caused collective abort of all ranks
> exit status of rank 0: killed by signal 9
> It seems that there is segmentation fault on node 0.
> But, if the program is run for a short time, no problem.
> Any help is appreciated.
> July 22 2010
> The New Busy is not the old busy. Search, chat and e-mail from your inbox. Get started. _______________________________________________
> users mailing list
For corporate legal information go to: