Sorry for delayed response - been trying to catch up from a longer-than-anticipated trip.
On Feb 10, 2010, at 3:39 AM, sanjay kumar jaiswal wrote:
> I am currently working on OMPI in order to handle fault
> tolerance in distributed environment for parallel computing where
> infrastructure is not fixed. after assigning the task to n number of
> nodes if link goes down or any process just gets failed then whole
> process either get hanged or rolled back. i want to handle this
> situation in followings ways-
> 1- to detect whether process gets failed or link goes down
Those are, of course, two quite different scenarios. We currently detect them in several ways:
1. process failure is detected by a waitpid callback to the local daemon. In addition, the daemon notes the closure of a TCP socket to that proc (used for routing communications) and pipes used to forward I/O.
2. link failure is detected when mpirun (or the cluster manager for this project) sees either closure of the TCP socket to the daemon on that node or (if enabled) two consecutive missing heartbeats from the daemon.
Both of these detection methods are currently in OMPI, both the developer's trunk and official releases
> 2- after detecting fault, how this task can be reassigned to another
> available node
Migration of non-MPI processes is supported today by ORCM. If we see a process fail, then we use ORTE's "resilient mapper" to come up with a (hopefully) intelligent place to put the restarted process. If the node fails (i.e., we lose the link to that node), then we migrate all the processes to new locations - they don't necessarily stick together (it is an option you can set). New daemons are started as required.
Josh Hursey (Indiana U, and on this list) is working on extending the migration to include MPI processes, so I'll let him talk about that.
While this capability is in ORCM today, Josh and I are working (as we speak!) to migrate the work to OMPI as well. So hopefully soon, it will be possible for mpirun to perform many of the same functions. Not everything, of course, will be appropriate for MPI jobs!
> so right now to handle this type of error is not possible in
> OMPI as of my knowledge. I want to know about the ORCM project what
> are the problems that is going to be solved and could I know approx
> what time first version is going to be released.
As I said, some of it is available now, and the rest should be coming into the OMPI developer's trunk soon.
First release of ORCM should occur in the April-May time frame. This will include process monitoring and hardware sensors, as well as some fault prediction capability. See the presentation at:
for an overview of how ORCM works.
And feel free to ask questions. I'm behind on my promised documentation...sigh.
> sanjay jaiswal
> orcm-devel mailing list