Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2007-03-21 14:39:47


What you're looking for is called PVM. Moreover, your requirements
are a mixed bags of FT features that comes from completely different
worlds.

1) Recover any software/hardware crashes ? What kind of recovery
you're looking for ? What is your definition of recovering ? If what
you want is to be able to continue to send or receive messages once
the fault was detected then FT-MPI is the only MPI implementation
which allow you to consistently continue your execution. To be more
precise the MPI standard do not define the behavior of MPI library
once you get back from the error handler which get called once a
fault has been detected. As far as I know, the behavior is dependent
on the MPI library, and with the exception of FT-MPI no other library
have a consistent state after returning from the error handler.

2) Dynamically shrink and grow ? Based on what ? Look like MPI-2
dynamic processes except you still have the original MPI_COMM_WORLD
who cannot be shrinked. If what you want is to be able to shrink your
MPI_COMM_WORLD when a fault occur, then again the only solution is FT-
MPI.

3) Migrate processes among machines ? What processes ? When and how ?
LAM allow you to checkpoint/restart the entire job, and it should be
done before the fault occur. MPICH-V allow transparent non-
coordinated checkpointing (i.e. you don't get any notification that a
fault was detected), but you will pay the cost of message logging. FT-
MPI modifies the runtime environment when a fault occurs, but does
not do migration (if migration means moving the application image
with all the data into another machine).

Unfortunately, there is no miracle MPI which is able to do all the
stuff you're looking for. You need multi-threading and fault
tolerance ? I would use FT-MPI with a lock around all MPI functions,
something close to the serialized thread mode as defined by the MPI
standard.

   george.

On Mar 21, 2007, at 1:09 PM, Mohammad Huwaidi wrote:

> Hello folks,
>
> I am trying to write some fault-tolerance systems with the
> following criteria:
> 1) Recover any software/hardware crashes
> 2) Dynamically Shrink and grow.
> 3) Migrate processes among machines.
>
> Does anyone has examples of code? What MPI platform is recommended
> to accomplish such requirements?
>
> I am using three MPI platforms and each has it own issues:
> 1) MPICH2 - good multi-threading support, but bad fault-tolerance
> mechanisms.
> 2) OpenMPI - Does not support multi-threading properly and cannot
> have it trap exceptions yet.
> 3) FT-MPI - Old and does not support multi-threading at all.
>
> Any suggestions?
> --
>
> Regards,
> Mohammad Huwaidi
>
> We can't resolve problems by using the same kind of thinking we used
> when we created them.
> --Albert Einstein
> <mohammad.vcf>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users