Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OMPI error terminate w/o reasons
From: David Zhang (solarbikedz_at_[hidden])
Date: 2011-03-27 18:32:51


This might not have anything to do with your problem, but how do you
finalize your worker nodes when your master loop terminates?

On Sun, Mar 27, 2011 at 3:27 PM, Jack Bryan <dtustudy68_at_[hidden]> wrote:

> Hi, my original bug is :
>
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 77967 on node n342 exited on
> signal 9 (Killed).
> --------------------------------------------------------------------------
>
> The main framework of my code is:
>
> main()
> {
> for masternode:
> while (loop <= LOOP_NUMBER)
> {
> master node distributes tasks to workers;
> master collects results from workers;
> ++loop;
> }
> for worker nodes:
> {
> get the task ;
> run the task; // call CPLEX API lib
> return results to master;
> }
> }
>
> When the LOOP_NUMBER <= 600 (with 200 parallel processes), it works well.
> But, when LOOP_NUMBER >= 700 (with 200 parallel processes), it got error:
>
> The possible limit of my Torque may be reason for the above error ?
>
> It seems that Torque complains about my high I/O caused by print out
> something from each process.
>
> But, if I comment out the printout statements in my code the Torque complains
> will be gone, but
> the signal 9 error is still there.
>
> Any help is really appreciated.
>
> thanks
>
> Jack
>
>
> ------------------------------
> From: rhc_at_[hidden]
> Date: Sun, 27 Mar 2011 13:08:31 -0600
>
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> It means that Torque is unhappy with your job - either you are running
> longer than it permits, or you exceeded some other system limit.
>
> Talk to your sys admin about imposed limits. Usually, there are flags you
> can provide to your job submission that allow you to change limits for your
> program.
>
>
> On Mar 27, 2011, at 12:59 PM, Jack Bryan wrote:
>
> Hi, I have figured out how to run the command.
>
> OMPI_RANKFILE=$HOME/$PBS_JOBID.ranks
>
> mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib
> -output-filename 700g200i200p14ye ./myapplication
>
> Each process print out to a distinct file.
>
> But, the program is terminated by the error :
>
> ---------------------------------------------------------------------------------------------------------------------
> =>> PBS: job killed: node 18 (n314) requested job terminate, 'EOF' (code
> 1099) - received SISTER_EOF attempting to communicate with sister MOM's
> mpirun: Forwarding signal 10 to job
> mpirun: killing job...
>
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
> n341
> n338
> n337
> n336
> n335
> n334
> n333
> n332
> n331
> n329
> n328
> n326
> n324
> n321
> n318
> n316
> n315
> n314
> n313
> n312
> n309
> n308
> n306
> n305
>
> --------------------------------------------------------------------
>
> After searching, I find that the error is probably related to the highly
> frequent I/O activities.
>
> I have also run valgrind to do mem check in order to find the possible
> reason for the original
> signal 9 (SIGKILL) problem.
>
> mpirun -np 200 -rf $OMPI_RANKFILE --mca btl self,sm,openib
> /usr/bin/valgrind --tool=memcheck --error-limit=no --leak-check=yes
> --log-file=nsga2b_g700_pop200_p200_valg_cystorm_mpi.log ./myapplication
>
> But, I got the similar error as the above.
>
> What does the error mean ?
> I cannot change the file system of the cluster.
>
> I only want to find a way to find the bug, which only appears in the case
> that the problem size is very large.
>
> But, I am stucked by the SIGKILL and then the above MOM_SISTER issues now.
>
> Any help is really appreciated.
>
> thanks
>
> Jack
>
>
> --------------------------------------------------------------------------------------------------------
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 20:47:19 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> That command line cannot possibly work. Both the -rf and --output-filename
> options require arguments.
>
> PLEASE read the documentation? mpirun -h, or "man mpirun" will tell you how
> to correctly use these options.
>
>
> On Mar 26, 2011, at 6:35 PM, Jack Bryan wrote:
>
> Hi, I used :
>
> mpirun -np 200 -rf --output-filename /mypath/myapplication
> But, no files are printed out.
>
> Can "--debug" option help me hear ?
>
> When I tried :
>
> -bash-3.2$ mpirun -debug
> --------------------------------------------------------------------------
> A suitable debugger could not be found in your PATH. Check the values
> specified in the orte_base_user_debugger MCA parameter for the list of
> debuggers that was searched.
> --------------------------------------------------------------------------
> Any help is really appreciated.
>
> thanks
>
> ------------------------------
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 15:45:39 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> If you use that mpirun option, mpirun will place the output from each rank
> into a -separate- file for you. Give it:
>
> mpirun --output-filename /myhome/debug/run01
>
> and in /myhome/debug, you will find files:
>
> run01.0
> run01.1
> ...
>
> each with the output from the indicated rank.
>
>
>
> On Mar 26, 2011, at 3:41 PM, Jack Bryan wrote:
>
> The cluster can print out all output into one file.
>
> But, checking them for bugs is very hard.
>
> The cluster also print out possible error messages into one file.
>
> But, sometimes the error file is empty , sometimes it is signal 9.
>
> If I only run dummy tasks on worker nodes, no errors.
>
> If I run real task, sometimes processes are terminated w/o any errors
> before the program normally exit.
> Sometimes, the program get signal 9 but no other error messages.
>
> It is weird.
>
> Any help is really appreciated.
>
> Jack
> ------------------------------
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 15:18:53 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> I don't know, but Ashley may be able to help - or you can see his web site
> for instructions.
>
> Alternatively, since you can put print statements into your code, have you
> considered using mpirun's option to direct output from each rank into its
> own file? Look at "mpirun -h" for the options.
>
> -output-filename|--output-filename <arg0>
> Redirect output from application processes into
> filename.rank
>
>
> On Mar 26, 2011, at 2:48 PM, Jack Bryan wrote:
>
> Is it possible to enable padb to print out the stack trace and other
> program execute information into a file ?
>
> I can run the program in gdb as this:
>
> mpirun -np 200 -e gdb ./myapplication
>
> How to make gdb print out the debug information to a file ?
> So that I can check it when the program is terminated.
>
> thanks
>
> Jack
>
> ------------------------------
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 13:56:13 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> You don't need to install anything on a system folder - you can just
> install it in your home directory, assuming that is accessible on the remote
> nodes.
>
> As for the script - unless you can somehow modify it to allow you to run
> under a debugger, I am afraid you are completely out of luck.
>
>
> On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote:
>
> Hi,
>
> I am working on a cluster, where I am not allowed to install software on
> system folder.
>
> My Open MPI is 1.3.4.
>
> I have a very quick of the padb on http://padb.pittman.org.uk/ .
>
> Does it require some software install on the cluster in order to use it ?
>
> I cannot use command-line to run job on the lcuster , but only script.
>
> thanks
>
> ------------------------------
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 12:12:11 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> Have you tried a parallel debugger such as padb?
>
> On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:
>
> Hi,
>
> I have tried this. But, the printout from 200 parallel processes make it
> very hard to locate the possible bug.
>
> They may not stop at the same point when the program got signal 9.
>
> So, even though I can figure out the print out statements from all
> 200 processes, so many different locations where the processes
> are stopped make it harder to find out some hints about the bug.
>
> Are there some other programming tricks, which can help me
> narrow down to the doubt points ASAP.
> Any help is appreciated.
>
> Jack
>
> ------------------------------
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 07:53:40 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> Try adding some print statements so you can see where the error occurs.
>
> On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:
>
> Hi , All:
>
> I running a Open MPI (1.3.4) program by 200 parallel processes.
>
> But, the program is terminated with
>
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 77967 on node n342 exited on
> signal 9 (Killed).
> --------------------------------------------------------------------------
>
> After searching, the signal 9 means:
>
> the process is currently in an unworkable state and should be terminated
> with extreme prejudice
>
> If a process does not respond to any other termination signals, sending
> it a SIGKILL signal will almost always cause it to go away.
>
> The system will generate SIGKILL for a process itself under some unusual
> conditions where the program cannot possibly continue to run (even to run a
> signal handler).
>
> But, the error message does not indicate any possible reasons for the
> termination.
>
> There is a FOR loop in the main() program, if the loop number is small (<
> 200), the program works well,
> but if it becomes lager and larger, the program will got SIGKILL.
>
> The cluster where I am running the MPI program does not allow running debug
> tools.
>
> If I run it on a workstation, it will take a very very long time (for > 200
> loops) in order to
> get the error occur again.
>
> What can I do to find the possible bugs ?
>
> Any help is really appreciated.
>
> thanks
>
> Jack
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________ users mailing list
> users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________ users mailing list
> users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________ users mailing list
> users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________ users mailing list
> users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________ users mailing list
> users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________ users mailing list
> users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> _______________________________________________ users mailing list
> users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
David Zhang
University of California, San Diego