MPI won't do this - if a node dies, the entire MPI job is terminated.
Take a look at OpenRCM, a subproject of Open MPI:
This is designed to do what you describe as we have a similar (open source) project underway at Cisco. If I were writing your system, I would:
(a) add my sensors to the orte/mca/sensor framework. You'll find that we already monitor memory usage, for example. Use the orte/mca/db framework to store your data in a database. Several different databases are already supported, though it is easy to add another if you want (e.g., sqlite support).
(b) add my desired error response to the src/orte/mca/errmgr/orcm module. The ability to migrate processes is already implemented, but you may need to do something additional to migrate a VM. If you prefer, you can create your own module in that area and use one of the other components as an example.
Then let orcm start its daemons across your nodes. Orcm daemons will do the monitoring and reporting for you, and will start and monitor the virtual machines. If you set the max local restarts to 0, and max global restarts to some number, the system will automatically migrate any failures to other nodes.
See the June 2010 presentation under "Publications" on the web page above for an overview of how it all works. If you decide to go this route, I'll be happy to provide advice and further explanation. And of course, you are welcome to participate in ORCM if you choose.
On Oct 22, 2010, at 6:09 AM, Vasiliy G Tolstov wrote:
> On Fri, 2010-10-22 at 14:07 +0200, Reuti wrote:
>> Am 22.10.2010 um 10:58 schrieb Vasiliy G Tolstov:
>>> Hello. May be this question already answered, but i can't see it in list
>>> I'm running about 60 Xen nodes with about 7-20 virtual machines under
>>> it. I want to gather disk,cpu,memory,network utilisation from virtual
>>> machines and get it into database for later processing.
>>> As i see, my architecture like this - One or two master servers with mpi
>>> process with rank 0, that can insert data into database. This master
>>> servers spawns on each Xen node mpi process, that gather statistics from
>>> virtual machines on that node and send it to masters (may be with
>>> multicast request). On each virtual machine i have process (mpi) that
>>> can get and send data to mpi process on each Xen node. Virtual machine
>>> have ability to migrate on other Xen node....
>> do you want just to monitor the physical and virtual machines by an application running under MPI? It sounds like it could be done by Ganglia or Nagios then.
> No.. I want to get realtime data to decide what virtual machine i need
> to migrate to other Xen, becouse it need more resources.
> Vasiliy G Tolstov <v.tolstov_at_[hidden]>
> users mailing list