Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI killed by signal 9
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-07-22 20:29:38


Signal 9 more than likely means that some external entity killed your MPI job (e.g., a resource manager determined that your process took too much time / CPU / whatever and killed it). That also makes sense since you say that short jobs complete with no problem, but (assumedly) longer jobs get killed like you described below -- with signal 9.

You might want to check with your system administrator and see if there are any resource limits on user-run applications.

On Jul 22, 2010, at 8:18 PM, Jack Bryan wrote:

> Dear All:
>
> I run a parallel job on 6 nodes of an OpenMPI cluster.
>
> But I got error:
>
> rank 0 in job 82 system.cluster_37948 caused collective abort of all ranks
> exit status of rank 0: killed by signal 9
>
> It seems that there is segmentation fault on node 0.
>
> But, if the program is run for a short time, no problem.
>
> Any help is appreciated.
>
> thanks,
>
> Jack
>
> July 22 2010
>
> The New Busy is not the old busy. Search, chat and e-mail from your inbox. Get started. _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/