Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] dinamic spawn process on remote node
From: Ralph Castain (rhc_at_[hidden])
Date: 2010-10-22 09:36:23

MPI won't do this - if a node dies, the entire MPI job is terminated.

Take a look at OpenRCM, a subproject of Open MPI:

This is designed to do what you describe as we have a similar (open source) project underway at Cisco. If I were writing your system, I would:

(a) add my sensors to the orte/mca/sensor framework. You'll find that we already monitor memory usage, for example. Use the orte/mca/db framework to store your data in a database. Several different databases are already supported, though it is easy to add another if you want (e.g., sqlite support).

(b) add my desired error response to the src/orte/mca/errmgr/orcm module. The ability to migrate processes is already implemented, but you may need to do something additional to migrate a VM. If you prefer, you can create your own module in that area and use one of the other components as an example.

Then let orcm start its daemons across your nodes. Orcm daemons will do the monitoring and reporting for you, and will start and monitor the virtual machines. If you set the max local restarts to 0, and max global restarts to some number, the system will automatically migrate any failures to other nodes.

See the June 2010 presentation under "Publications" on the web page above for an overview of how it all works. If you decide to go this route, I'll be happy to provide advice and further explanation. And of course, you are welcome to participate in ORCM if you choose.


On Oct 22, 2010, at 6:09 AM, Vasiliy G Tolstov wrote:

> On Fri, 2010-10-22 at 14:07 +0200, Reuti wrote:
>> Hi,
>> Am 22.10.2010 um 10:58 schrieb Vasiliy G Tolstov:
>>> Hello. May be this question already answered, but i can't see it in list
>>> archive.
>>> I'm running about 60 Xen nodes with about 7-20 virtual machines under
>>> it. I want to gather disk,cpu,memory,network utilisation from virtual
>>> machines and get it into database for later processing.
>>> As i see, my architecture like this - One or two master servers with mpi
>>> process with rank 0, that can insert data into database. This master
>>> servers spawns on each Xen node mpi process, that gather statistics from
>>> virtual machines on that node and send it to masters (may be with
>>> multicast request). On each virtual machine i have process (mpi) that
>>> can get and send data to mpi process on each Xen node. Virtual machine
>>> have ability to migrate on other Xen node....
>> do you want just to monitor the physical and virtual machines by an application running under MPI? It sounds like it could be done by Ganglia or Nagios then.
> No.. I want to get realtime data to decide what virtual machine i need
> to migrate to other Xen, becouse it need more resources.
> --
> Vasiliy G Tolstov <v.tolstov_at_[hidden]>
> Selfip.Ru
> _______________________________________________
> users mailing list
> users_at_[hidden]