Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OMPI error terminate w/o reasons
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-03-27 15:08:31


It means that Torque is unhappy with your job - either you are running longer than it permits, or you exceeded some other system limit.

Talk to your sys admin about imposed limits. Usually, there are flags you can provide to your job submission that allow you to change limits for your program.

On Mar 27, 2011, at 12:59 PM, Jack Bryan wrote:

> Hi, I have figured out how to run the command.
>
> OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks
>
> mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib -output-filename 700g200i200p14ye ./myapplication
>
> Each process print out to a distinct file.
>
> But, the program is terminated by the error :
> ---------------------------------------------------------------------------------------------------------------------
> =>> PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code 1099) - received SISTER_EOF attempting to communicate with sister MOM's
> mpirun: Forwarding signal 10 to job
> mpirun: killing job...
>
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
> n341
> n338
> n337
> n336
> n335
> n334
> n333
> n332
> n331
> n329
> n328
> n326
> n324
> n321
> n318
> n316
> n315
> n314
> n313
> n312
> n309
> n308
> n306
> n305
>
> --------------------------------------------------------------------
>
> After searching, I find that the error is probably related to the highly frequent I/O activities.
>
> I have also run valgrind to do mem check in order to find the possible reason for the original
> signal 9 (SIGKILL) problem.
>
> mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib /usr/bin/valgrind --tool=memcheck --error-limit=no --leak-check=yes --log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log ./myapplication
>
> But, I got the similar error as the above.
>
> What does the error mean ?
> I cannot change the file system of the cluster.
>
> I only want to find a way to find the bug, which only appears in the case that the problem size is very large.
>
> But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now.
>
> Any help is really appreciated.
>
> thanks
>
> Jack
>
> --------------------------------------------------------------------------------------------------------
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 20:47:19 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> That command line cannot possibly work. Both the -rf and --output-filename options require arguments.
>
> PLEASE read the documentation? mpirun -h, or "man mpirun" will tell you how to correctly use these options.
>
>
> On Mar 26, 2011, at 6:35 PM, Jack Bryan wrote:
>
> Hi, I used :
>
> mpirun -np 200 -rf --output-filename /mypath/myapplication
> But, no files are printed out.
>
> Can "--debug" option help me hear ?
>
> When I tried :
>
> -bash-3.2$ mpirun -debug
> --------------------------------------------------------------------------
> A suitable debugger could not be found in your PATH. Check the values
> specified in the orte_base_user_debugger MCA parameter for the list of
> debuggers that was searched.
> --------------------------------------------------------------------------
> Any help is really appreciated.
>
> thanks
>
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 15:45:39 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> If you use that mpirun option, mpirun will place the output from each rank into a -separate- file for you. Give it:
>
> mpirun --output-filename /myhome/debug/run01
>
> and in /myhome/debug, you will find files:
>
> run01.0
> run01.1
> ...
>
> each with the output from the indicated rank.
>
>
>
> On Mar 26, 2011, at 3:41 PM, Jack Bryan wrote:
>
> The cluster can print out all output into one file.
>
> But, checking them for bugs is very hard.
>
> The cluster also print out possible error messages into one file.
>
> But, sometimes the error file is empty , sometimes it is signal 9.
>
> If I only run dummy tasks on worker nodes, no errors.
>
> If I run real task, sometimes processes are terminated w/o any errors before the program normally exit.
> Sometimes, the program get signal 9 but no other error messages.
>
> It is weird.
>
> Any help is really appreciated.
>
> Jack
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 15:18:53 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> I don't know, but Ashley may be able to help - or you can see his web site for instructions.
>
> Alternatively, since you can put print statements into your code, have you considered using mpirun's option to direct output from each rank into its own file? Look at "mpirun -h" for the options.
>
> -output-filename|--output-filename <arg0>
> Redirect output from application processes into
> filename.rank
>
>
> On Mar 26, 2011, at 2:48 PM, Jack Bryan wrote:
>
> Is it possible to enable padb to print out the stack trace and other program execute information into a file ?
>
> I can run the program in gdb as this:
>
> mpirun -np 200 -e gdb ./myapplication
>
> How to make gdb print out the debug information to a file ?
> So that I can check it when the program is terminated.
>
> thanks
>
> Jack
>
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 13:56:13 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> You don't need to install anything on a system folder - you can just install it in your home directory, assuming that is accessible on the remote nodes.
>
> As for the script - unless you can somehow modify it to allow you to run under a debugger, I am afraid you are completely out of luck.
>
>
> On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote:
>
> Hi,
>
> I am working on a cluster, where I am not allowed to install software on system folder.
>
> My Open MPI is 1.3.4.
>
> I have a very quick of the padb on http://padb.pittman.org.uk/ .
>
> Does it require some software install on the cluster in order to use it ?
>
> I cannot use command-line to run job on the lcuster , but only script.
>
> thanks
>
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 12:12:11 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> Have you tried a parallel debugger such as padb?
>
> On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:
>
> Hi,
>
> I have tried this. But, the printout from 200 parallel processes make it
> very hard to locate the possible bug.
>
> They may not stop at the same point when the program got signal 9.
>
> So, even though I can figure out the print out statements from all
> 200 processes, so many different locations where the processes
> are stopped make it harder to find out some hints about the bug.
>
> Are there some other programming tricks, which can help me
> narrow down to the doubt points ASAP.
> Any help is appreciated.
>
> Jack
>
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 07:53:40 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> Try adding some print statements so you can see where the error occurs.
>
> On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:
>
> Hi , All:
>
> I running a Open MPI (1.3.4) program by 200 parallel processes.
>
> But, the program is terminated with
>
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).
> --------------------------------------------------------------------------
>
> After searching, the signal 9 means:
>
> the process is currently in an unworkable state and should be terminated with extreme prejudice
>
> If a process does not respond to any other termination signals, sending it a SIGKILL signal will almost always cause it to go away.
>
> The system will generate SIGKILL for a process itself under some unusual conditions where the program cannot possibly continue to run (even to run a signal handler).
>
> But, the error message does not indicate any possible reasons for the termination.
>
> There is a FOR loop in the main() program, if the loop number is small (< 200), the program works well,
> but if it becomes lager and larger, the program will got SIGKILL.
>
> The cluster where I am running the MPI program does not allow running debug tools.
>
> If I run it on a workstation, it will take a very very long time (for > 200 loops) in order to
> get the error occur again.
>
> What can I do to find the possible bugs ?
>
> Any help is really appreciated.
>
> thanks
>
> Jack
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users