First of all, thank you for answers.

I have a bit more questions, added below.

 

What is the behavior in case a node dies or becomes unreachable?

Your run will be aborted. However there is checkpoint/restart support for Linux http://www.open-mpi.org/faq/?category=ft

 

As this is a Win32 program, I’ll have to take into account that there is only the « abort » behavior.

 

What makes any given machine become a node available for tasks?

You define it in a host file or a batch system tells it OpenMPI.

 

So there is no dynamic discovery of nodes available on the network. Unless, of course, if I was to write a tool that would do it before the actual run is started.



Is there a monitoring tool that would give me indications of the status and health of the nodes?

This has nothing to do with MPI. Nagios or Ganglia can do that.

 

I was more thinking of a tool that would tell me a node is already performing a task, so that I can avoid having it oversubscribed.



I’m quite sure all these are trivial questions for those with more experience, but I’m having a hard time finding resources that would answer those.

Read an introduction on programming with MPI and another one on Beowulf clusters (batch systems, monitoring, shared file systems). This should give you enough information on the topic. If you don't mind spending more money on software you can also take a look at Microsofts HPC Server.

I’ve started looking at beowulf clusters, and that lead me to PBS. Am I right in assuming that PBS (PBSPro or TORQUE) could be used to do the monitoring and the load balancing I thought of?

 

Thanks

Olivier