You don't need to install anything on a system folder - you can just install it in your home directory, assuming that is accessible on the remote nodes.

As for the script - unless you can somehow modify it to allow you to run under a debugger, I am afraid you are completely out of luck.


On Mar 26, 2011, at 12:54 PM, Jack Bryan wrote:

Hi, 

I am working on a cluster, where I am not allowed to install software on system folder. 

My Open MPI is 1.3.4. 

I have a very quick of the padb on http://padb.pittman.org.uk/ 

Does it require some software install on the cluster in order to use it ? 

I cannot use command-line to run job on the lcuster , but only script.

thanks


From: rhc@open-mpi.org
Date: Sat, 26 Mar 2011 12:12:11 -0600
To: users@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

Have you tried a parallel debugger such as padb?

On Mar 26, 2011, at 10:34 AM, Jack Bryan wrote:

Hi, 

I have tried this. But, the printout from 200 parallel processes make it 
very hard to locate the possible bug. 

They may not stop at the same point when the program got signal 9.

So, even though I can figure out the print out statements from all
200 processes, so many different locations where the processes
are stopped make it harder to find out some hints about the bug. 

Are there some other programming tricks, which can help me 
narrow down to the doubt points ASAP.
Any help is appreciated. 

Jack


From: rhc@open-mpi.org
Date: Sat, 26 Mar 2011 07:53:40 -0600
To: users@open-mpi.org
Subject: Re: [OMPI users] OMPI error terminate w/o reasons

Try adding some print statements so you can see where the error occurs.

On Mar 25, 2011, at 11:49 PM, Jack Bryan wrote:

Hi , All: 

I running a Open MPI (1.3.4) program by 200 parallel processes. 

But, the program is terminated with 

--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 77967 on node n342 exited on signal 9 (Killed).
--------------------------------------------------------------------------

After searching, the signal 9 means: 

the process is currently in an unworkable state and should be terminated with extreme prejudice

 If a process does not respond to any other termination signals, sending it a SIGKILL signal will almost always cause it to go away.

 The system will generate SIGKILL for a process itself under some unusual conditions where the program cannot possibly continue to run (even to run a signal handler).
 
But, the error message does not indicate any possible reasons for the termination. 

There is a FOR loop in the main() program, if the loop number is small (< 200), the program works well, 
but if it becomes lager and larger, the program will got SIGKILL. 

The cluster where I am running the MPI program does not allow running debug tools. 

If I run it on a workstation, it will take a very very long time (for > 200 loops) in order to 
get the error occur again. 

What can I do to find the possible bugs ? 

Any help is really appreciated. 

thanks

Jack





_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________ users mailing list users@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________ users mailing list users@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users