On May 16, 2010, at 5:33 PM, Fernando Lemos wrote:
> On Sun, Apr 18, 2010 at 11:04 PM, Ralph Castain <RCASTAIN_at_[hidden]> wrote:
>> Hello all
>> Please feel free to ask questions and/or contribute ideas. Anyone interested in contributing to the effort is welcome!
> What's not entirely clear to me yet after reading the documentation
> and taking a brief look at the source code is how OpenRCM integrates
> with OpenMPI. Is MPI supported at all in trunk? How would I compile
> and run an MPI application in this environment? Should I compile my
> app with orcmcc or mpicc? Should I run it with orcm-start, mpirun or
ORCM doesn't care what it runs. Think of it as being like Torque or slurm.
We really don't have an OMPI-ORCM integration yet. What I need to do is write the ORTE plm module to support ORCM. It's on my list of things to-do, but probably won't happen for awhile.
> I think that some documentation on how this integration happens (if it
> happens at all at this time) would be very helpful. For example, does
> orcm-start offer mpirun an environment that OpenMPI uses to launch the
> MPI processes (Slurm-like), or does orcm-start launch the processes
> directly (e.g. orted + your app, something like mpirun already does)
> or something else.
When we get things done the way we expect, you would use mpirun to start your job - it would tell orcm what to run and where to run it.
orcm-start and orcm-stop are just temporary tools for development - they will eventually disappear.
> Also, I believe the code still does not support restarting
> applications from checkpoints yet, is this assumption correct?
Yes. I believe you will see better integration between OMPI's checkpoint/restart and ORCM as time goes on - especially once Josh finishes writing his dissertation and can (hopefully) spend some time on it.
We haven't really defined the OMPI vs ORCM boundaries just yet, which is why the interface isn't clear. What has been happening so far is that error recovery procedures and other capabilities developed in ORCM wind up migrating over into ORTE so that mpirun can use them directly. So, for example, virtually all of the recovery capability is now available in ORTE and usable from mpirun itself.
Which raises the question: what exactly does ORCM do? Right now, it is looking like ORCM will serve the role of a more scalable Torque or slurm (both in terms of launch and wireup times), with the ability to monitor system state-of-health, predict problems, and reconfigure the system on-the-fly to avoid failures.
Over time, I expect the respective roles will clarify. I know the documentation leaves something to be desired, but this is one reason why - hard to document a moving target :-/
We are having an orcm project meeting this week to discuss some of these issues, and to start some students on developing a few orcm capabilities. So I think this summer will see some major progress on clarifying these questions.
> Thanks for your time,
> orcm-devel mailing list