On Fri, 2010-10-22 at 07:36 -0600, Ralph Castain wrote:
> MPI won't do this - if a node dies, the entire MPI job is terminated.
> Take a look at OpenRCM, a subproject of Open MPI:
> This is designed to do what you describe as we have a similar (open
> source) project underway at Cisco. If I were writing your system, I
> (a) add my sensors to the orte/mca/sensor framework. You'll find that
> we already monitor memory usage, for example. Use the orte/mca/db
> framework to store your data in a database. Several different
> databases are already supported, though it is easy to add another if
> you want (e.g., sqlite support).
> (b) add my desired error response to the src/orte/mca/errmgr/orcm
> module. The ability to migrate processes is already implemented, but
> you may need to do something additional to migrate a VM. If you
> prefer, you can create your own module in that area and use one of the
> other components as an example.
> Then let orcm start its daemons across your nodes. Orcm daemons will
> do the monitoring and reporting for you, and will start and monitor
> the virtual machines. If you set the max local restarts to 0, and max
> global restarts to some number, the system will automatically migrate
> any failures to other nodes.
> See the June 2010 presentation under "Publications" on the web page
> above for an overview of how it all works. If you decide to go this
> route, I'll be happy to provide advice and further explanation. And of
> course, you are welcome to participate in ORCM if you choose.
Thank You very much. I think this is very useful for me. Can You provide
me link to presentation (i can't see it under
And can You send me very simple example, how can i use ORCM.. (may be i
can get some useful information by reading
Does ORCM have man pages for functions like openmpi?
Vasiliy G Tolstov <v.tolstov_at_[hidden]>