Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] OMPI error terminate w/o reasons
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-03-26 09:53:40


Try adding some print statements so you can see where the error occurs.

On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:

> Hi , All:
>
> I running a Open MPI (1.3.4) program by 200 parallel processes.
>
> But, the program is terminated with
>
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).
> --------------------------------------------------------------------------
>
> After searching, the signal 9 means:
>
> the process is currently in an unworkable state and should be terminated with extreme prejudice
>
> If a process does not respond to any other termination signals, sending it a SIGKILL signal will almost always cause it to go away.
>
> The system will generate SIGKILL for a process itself under some unusual conditions where the program cannot possibly continue to run (even to run a signal handler).
>
> But, the error message does not indicate any possible reasons for the termination.
>
> There is a FOR loop in the main() program, if the loop number is small (< 200), the program works well,
> but if it becomes lager and larger, the program will got SIGKILL.
>
> The cluster where I am running the MPI program does not allow running debug tools.
>
> If I run it on a workstation, it will take a very very long time (for > 200 loops) in order to
> get the error occur again.
>
> What can I do to find the possible bugs ?
>
> Any help is really appreciated.
>
> thanks
>
> Jack
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users