I am currently working on a Win32 program that makes some intensive calculation, and is already written to be multithreaded. As a result, it uses all the available cores on the PC it runs on.
The basic behavior is for the user to open a model, click the “start” button, then the threads are spawned, and once all is finished, control is given back to the user.
While this works great, we have found that for larger models, the computation time is limited by the number of cores as the pool of tasks that could run in parallel is not empty.
As a result, we are investigating the possibility to use grid computing to somehow multiply the number of available cores.
This, of course, has technical challenges and reading documentation on various websites led me to the OpenMPI one and to this list.
I’m not sure it’s the appropriate place to ask my questions, but should it not be the case, please tell me what an appropriate place might be.
I understand that MPI is a framework that would facilitate the communication between the user’s computer and the nodes that perform the distributed tasks.
What I have a hard time grasping are these :
What communication layer is used? How do I choose it?
What is the behavior in case a node dies or becomes unreachable?
What makes any given machine become a node available for tasks?
Is there some sort of load balancing ?
Is there a monitoring tool that would give me indications of the status and health of the nodes?
How does the “MPI enabled” code gets transferred to the nodes? If I understand things correctly, I would have to write a separate command line exe that takes care of the tasks and this would be the exe that gets sent over to node.
I’m quite sure all these are trivial questions for those with more experience, but I’m having a hard time finding resources that would answer those.
Thanks in advance for your help