Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OMPI error terminate w/o reasons
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-03-27 20:29:11


+1 on what Ralph is saying.

You need to talk to your local administrators and ask them why Torque is killing your job. Perhaps you're submitting to a queue that only allows jobs to run for a few seconds, or something like that.

On Mar 27, 2011, at 3:08 PM, Ralph Castain wrote:

> It means that Torque is unhappy with your job - either you are running longer than it permits, or you exceeded some other system limit.
>
> Talk to your sys admin about imposed limits. Usually, there are flags you can provide to your job submission that allow you to change limits for your program.
>
>
> On Mar 27, 2011, at 12:59 PM, Jack Bryan wrote:
>
>> Hi, I have figured out how to run the command.
>>
>> OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks
>>
>> mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib -output-filename 700g200i200p14ye ./myapplication
>>
>> Each process print out to a distinct file.
>>
>> But, the program is terminated by the error :
>> ---------------------------------------------------------------------------------------------------------------------
>> =>> PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code 1099) - received SISTER_EOF attempting to communicate with sister MOM's
>> mpirun: Forwarding signal 10 to job
>> mpirun: killing job...
>>
>> --------------------------------------------------------------------------
>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>> below. Additional manual cleanup may be required - please refer to
>> the "orte-clean" tool for assistance.
>> --------------------------------------------------------------------------
>> n341
>> n338
>> n337
>> n336
>> n335
>> n334
>> n333
>> n332
>> n331
>> n329
>> n328
>> n326
>> n324
>> n321
>> n318
>> n316
>> n315
>> n314
>> n313
>> n312
>> n309
>> n308
>> n306
>> n305
>>
>> --------------------------------------------------------------------
>>
>> After searching, I find that the error is probably related to the highly frequent I/O activities.
>>
>> I have also run valgrind to do mem check in order to find the possible reason for the original
>> signal 9 (SIGKILL) problem.
>>
>> mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib /usr/bin/valgrind --tool=memcheck --error-limit=no --leak-check=yes --log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log ./myapplication
>>
>> But, I got the similar error as the above.
>>
>> What does the error mean ?
>> I cannot change the file system of the cluster.
>>
>> I only want to find a way to find the bug, which only appears in the case that the problem size is very large.
>>
>> But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now.
>>
>> Any help is really appreciated.
>>
>> thanks
>>
>> Jack
>>
>> --------------------------------------------------------------------------------------------------------
>> From: rhc_at_[hidden]
>> Date: Sat, 26 Mar 2011 20:47:19 -0600
>> To: users_at_[hidden]
>> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>>
>> That command line cannot possibly work. Both the -rf and --output-filename options require arguments.
>>
>> PLEASE read the documentation? mpirun -h, or "man mpirun" will tell you how to correctly use these options.
>>
>>
>> On Mar 26, 2011, at 6:35 PM, Jack Bryan wrote:
>>
>> Hi, I used :
>>
>> mpirun -np 200 -rf --output-filename /mypath/myapplication
>> But, no files are printed out.
>>
>> Can "--debug" option help me hear ?
>>
>> When I tried :
>>
>> -bash-3.2$ mpirun -debug
>> --------------------------------------------------------------------------
>> A suitable debugger could not be found in your PATH. Check the values
>> specified in the orte_base_user_debugger MCA parameter for the list of
>> debuggers that was searched.
>> --------------------------------------------------------------------------
>> Any help is really appreciated.
>>
>> thanks
>>
>> From: rhc_at_[hidden]
>> Date: Sat, 26 Mar 2011 15:45:39 -0600
>> To: users_at_[hidden]
>> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>>
>> If you use that mpirun option, mpirun will place the output from each rank into a -separate- file for you. Give it:
>>
>> mpirun --output-filename /myhome/debug/run01
>>
>> and in /myhome/debug, you will find files:
>>
>> run01.0
>> run01.1
>> ...
>>
>> each with the output from the indicated rank.
>>
>>
>>
>> On Mar 26, 2011, at 3:41 PM, Jack Bryan wrote:
>>
>> The cluster can print out all output into one file.
>>
>> But, checking them for bugs is very hard.
>>
>> The cluster also print out possible error messages into one file.
>>
>> But, sometimes the error file is empty , sometimes it is signal 9.
>>
>> If I only run dummy tasks on worker nodes, no errors.
>>
>> If I run real task, sometimes processes are terminated w/o any errors before the program normally exit.
>> Sometimes, the program get signal 9 but no other error messages.
>>
>> It is weird.
>>
>> Any help is really appreciated.
>>
>> Jack
>> From: rhc_at_[hidden]
>> Date: Sat, 26 Mar 2011 15:18:53 -0600
>> To: users_at_[hidden]
>> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>>
>> I don't know, but Ashley may be able to help - or you can see his web site for instructions.
>>
>> Alternatively, since you can put print statements into your code, have you considered using mpirun's option to direct output from each rank into its own file? Look at "mpirun -h" for the options.
>>
>> -output-filename|--output-filename <arg0>
>> Redirect output from application processes into
>> filename.rank
>>
>>
>> On Mar 26, 2011, at 2:48 PM, Jack Bryan wrote:
>>
>> Is it possible to enable padb to print out the stack trace and other program execute information into a file ?
>>
>> I can run the program in gdb as this:
>>
>> mpirun -np 200 -e gdb ./myapplication
>>
>> How to make gdb print out the debug information to a file ?
>> So that I can check it when the program is terminated.
>>
>> thanks
>>
>> Jack
>>
>> From: rhc_at_[hidden]
>> Date: Sat, 26 Mar 2011 13:56:13 -0600
>> To: users_at_[hidden]
>> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>>
>> You don't need to install anything on a system folder - you can just install it in your home directory, assuming that is accessible on the remote nodes.
>>
>> As for the script - unless you can somehow modify it to allow you to run under a debugger, I am afraid you are completely out of luck.
>>
>>
>> On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote:
>>
>> Hi,
>>
>> I am working on a cluster, where I am not allowed to install software on system folder.
>>
>> My Open MPI is 1.3.4.
>>
>> I have a very quick of the padb on http://padb.pittman.org.uk/ .
>>
>> Does it require some software install on the cluster in order to use it ?
>>
>> I cannot use command-line to run job on the lcuster , but only script.
>>
>> thanks
>>
>> From: rhc_at_[hidden]
>> Date: Sat, 26 Mar 2011 12:12:11 -0600
>> To: users_at_[hidden]
>> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>>
>> Have you tried a parallel debugger such as padb?
>>
>> On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:
>>
>> Hi,
>>
>> I have tried this. But, the printout from 200 parallel processes make it
>> very hard to locate the possible bug.
>>
>> They may not stop at the same point when the program got signal 9.
>>
>> So, even though I can figure out the print out statements from all
>> 200 processes, so many different locations where the processes
>> are stopped make it harder to find out some hints about the bug.
>>
>> Are there some other programming tricks, which can help me
>> narrow down to the doubt points ASAP.
>> Any help is appreciated.
>>
>> Jack
>>
>> From: rhc_at_[hidden]
>> Date: Sat, 26 Mar 2011 07:53:40 -0600
>> To: users_at_[hidden]
>> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>>
>> Try adding some print statements so you can see where the error occurs.
>>
>> On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:
>>
>> Hi , All:
>>
>> I running a Open MPI (1.3.4) program by 200 parallel processes.
>>
>> But, the program is terminated with
>>
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).
>> --------------------------------------------------------------------------
>>
>> After searching, the signal 9 means:
>>
>> the process is currently in an unworkable state and should be terminated with extreme prejudice
>>
>> If a process does not respond to any other termination signals, sending it a SIGKILL signal will almost always cause it to go away.
>>
>> The system will generate SIGKILL for a process itself under some unusual conditions where the program cannot possibly continue to run (even to run a signal handler).
>>
>> But, the error message does not indicate any possible reasons for the termination.
>>
>> There is a FOR loop in the main() program, if the loop number is small (< 200), the program works well,
>> but if it becomes lager and larger, the program will got SIGKILL.
>>
>> The cluster where I am running the MPI program does not allow running debug tools.
>>
>> If I run it on a workstation, it will take a very very long time (for > 200 loops) in order to
>> get the error occur again.
>>
>> What can I do to find the possible bugs ?
>>
>> Any help is really appreciated.
>>
>> thanks
>>
>> Jack
>>
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/