Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Help with some fundamentals
From: Olivier SANNIER (Olivier.SANNIER_at_[hidden])
Date: 2011-01-21 09:58:38


-----Message d'origine-----
De : users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]] De la part de Nico Mittenzwey
Envoyé : jeudi 20 janvier 2011 18:58
À : Open MPI Users
Objet : Re: [OMPI users] Help with some fundamentals

On 01/20/2011 05:50 PM, Olivier SANNIER wrote:
> What is the behavior in case a node dies or becomes unreachable?
> Your run will be aborted. However there is checkpoint/restart support
> for Linux http://www.open-mpi.org/faq/?category=ft
>
> As this is a Win32 program, I'll have to take into account that there is only the< abort> behavior.
AFAIK yes
> So there is no dynamic discovery of nodes available on the network. Unless, of course, if I was to write a tool that would do it before the actual run is started.
This is done by a batch system like PBS (torque) or SGE

> Is there a monitoring tool that would give me indications of the status and health of the nodes?
> This has nothing to do with MPI. Nagios or Ganglia can do that.
>
> I was more thinking of a tool that would tell me a node is already performing a task, so that I can avoid having it oversubscribed.
This is also done by a batch system
> I've started looking at beowulf clusters, and that lead me to PBS. Am I right in assuming that PBS (PBSPro or TORQUE) could be used to do the monitoring and the load balancing I thought of?
Yes, however the terms "monitoring" and "load balancing" are usually used in other contexts.

Thank you for your help, I now have a better understanding of the technical details involved with all this