Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OMPI error terminate w/o reasons
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-03-26 15:56:13


You don't need to install anything on a system folder - you can just install it in your home directory, assuming that is accessible on the remote nodes.

As for the script - unless you can somehow modify it to allow you to run under a debugger, I am afraid you are completely out of luck.

On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote:

> Hi,
>
> I am working on a cluster, where I am not allowed to install software on system folder.
>
> My Open MPI is 1.3.4.
>
> I have a very quick of the padb on http://padb.pittman.org.uk/ .
>
> Does it require some software install on the cluster in order to use it ?
>
> I cannot use command-line to run job on the lcuster , but only script.
>
> thanks
>
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 12:12:11 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> Have you tried a parallel debugger such as padb?
>
> On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:
>
> Hi,
>
> I have tried this. But, the printout from 200 parallel processes make it
> very hard to locate the possible bug.
>
> They may not stop at the same point when the program got signal 9.
>
> So, even though I can figure out the print out statements from all
> 200 processes, so many different locations where the processes
> are stopped make it harder to find out some hints about the bug.
>
> Are there some other programming tricks, which can help me
> narrow down to the doubt points ASAP.
> Any help is appreciated.
>
> Jack
>
> From: rhc_at_[hidden]
> Date: Sat, 26 Mar 2011 07:53:40 -0600
> To: users_at_[hidden]
> Subject: Re: [OMPI users] OMPI error terminate w/o reasons
>
> Try adding some print statements so you can see where the error occurs.
>
> On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:
>
> Hi , All:
>
> I running a Open MPI (1.3.4) program by 200 parallel processes.
>
> But, the program is terminated with
>
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).
> --------------------------------------------------------------------------
>
> After searching, the signal 9 means:
>
> the process is currently in an unworkable state and should be terminated with extreme prejudice
>
> If a process does not respond to any other termination signals, sending it a SIGKILL signal will almost always cause it to go away.
>
> The system will generate SIGKILL for a process itself under some unusual conditions where the program cannot possibly continue to run (even to run a signal handler).
>
> But, the error message does not indicate any possible reasons for the termination.
>
> There is a FOR loop in the main() program, if the loop number is small (< 200), the program works well,
> but if it becomes lager and larger, the program will got SIGKILL.
>
> The cluster where I am running the MPI program does not allow running debug tools.
>
> If I run it on a workstation, it will take a very very long time (for > 200 loops) in order to
> get the error occur again.
>
> What can I do to find the possible bugs ?
>
> Any help is really appreciated.
>
> thanks
>
> Jack
>
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________ users mailing list users_at_[hidden] http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users